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Abstract 

This  thesis  describes  a  vision-based,  large- area,  simultaneous  localization  and  mapping 
(SLAM)  algorithm  that  respects  the  low-overlap  imagery  constraints  typical  of  autonomous 
underwater  vehicles  (AUVs)  while  exploiting  the  inertial  sensor  information  that  is  routinely 
available  on  such  platforms.  We  adopt  a  systems-level  approach  exploiting  the  complemen¬ 
tary  aspects  of  inertial  sensing  and  visual  perception  from  a  calibrated  pose-instrumented 
platform.  This  systems-level  strategy  yields  a  robust  solution  to  underwater  imaging  that 
overcomes  many  of  the  unique  challenges  of  a  marine  environment  (e.g.,  unstructured  ter¬ 
rain,  low-overlap  imagery,  moving  light  source). 

Our  large-area  SLAM  algorithm  recursively  incorporates  relative-pose  constraints  using 
a  view-based  representation  that  exploits  exact  sparsity  in  the  Gaussian  canonical  form. 
This  sparsity  allows  for  efficient  0(n)  update  complexity  in  the  number  of  images  compos¬ 
ing  the  view-based  map  by  utilizing  recent  multilevel  relaxation  techniques.  We  show  that 
our  algorithmic  formulation  is  inherently  sparse  unlike  other  feature-based  canonical  SLAM 
algorithms,  which  impose  sparseness  via  pruning  approximations.  In  particular,  we  investi¬ 
gate  the  sparsification  methodology  employed  by  sparse  extended  information  filters  (SEIFs) 
and  offer  new  insight  as  to  why,  and  how,  its  approximation  can  lead  to  inconsistencies  in 
the  estimated  state  errors.  Lastly,  we  present  a  novel  algorithm  for  efficiently  extracting 
consistent  marginal  covariances  useful  for  data  association  from  the  information  matrix. 

In  summary,  this  thesis  advances  the  current  state-of-the-art  in  underwater  visual  navi¬ 
gation  by  demonstrating  end-to-end  automatic  processing  of  the  largest  visually  navigated 
dataset  to  date  using  data  collected  from  a  survey  of  the  RMS  Titanic  (path  length  over 
3  km  and  3100  m2  of  mapped  area).  This  accomplishment  embodies  the  summed  contribu¬ 
tions  of  this  thesis  to  several  current  SLAM  research  issues  including  scalability,  6  degree  of 
freedom  motion,  unstructured  environments,  and  visual  perception. 

Thesis  Supervisor:  Hanumant  Singh 
Title:  Associate  Scientist,  WHOI 

Thesis  Co-Supervisor:  John  J.  Leonard 
Title:  Associate  Professor,  MIT 


3 


4 


Acknowledgments 


Now  that  I’m  just  about  to  stop  drinking  from  the  fire  hose,  I  realize  both  what  a  tremendous 
opportunity  and  experience  I’ve  had  here  in  my  n-plus  years  in  the  MIT/WHOI  joint  program 
(where  it  oftentimes  seemed  like  n  was  closer  to  oo  than  to  0).  The  fact  that  I  made  it  to 
this  point  is  due  to  the  help  and  influence  of  many  others. 

First,  I’d  like  to  thank  my  advisor,  Hanu,  who  taught  me  that  the  answer  to  the  ultimate 
question  of  life,  the  universe,  and  everything  is  42,  that  all  people  are  cordially  referred 
to  as  “bubba”,  and  that  Pink  Floyd’s  “Dark  Side  of  the  Moon”  is  the  best  album  ever 
(actually  I  already  knew  that  one,  but  it  was  reassuring  to  find  out  that  my  advisor  was 
on  the  same  wavelength).  I  also  know  that  during  my  tenure  here  I  was  responsible  for  a 
number  of  gray  hairs  appearing  on  his  head  —  to  which  I  say  both  thank  you  and  you’re 
welcome.  In  addition,  I’d  like  to  acknowledge  my  appreciation  for  all  of  the  opportunities 
and  experiences  he’s  given  me  (research  cruises,  conferences/ workshops,  India,  SeaBED),  as 
well  as  for  helping  me  to  see  this  through  to  the  end. 

Next,  I’d  like  to  thank  my  committee:  John  Leonard,  Seth  Teller,  and  Louis  Whitcomb; 
they  have  provided  valuable  guidance  and  resources  along  the  way.  Specifically,  to  Louis 
for  helping  me  carry  out  my  validation  experiments  at  the  JHU  test  facility,  to  Seth  for  his 
meticulousness  and  for  his  help  in  defining  my  niche,  and  finally  to  my  co-advisor  John  for 
imprinting  upon  me  his  useful  idioms  about  graduate  school:  1)  “Must,  Should,  Would  be 
nice”,  2)  “you’re  number  one  goal  is  to  get  out”,  and  3)  “getting  a  PhD  is  about  having  a 
conversation  with  the  literature.” 

Now  that  those  formalities  are  out  of  the  way,  it’s  time  to  thank  the  people  who  actually 
toiled  with  me  in  the  trenches,  and  from  whom  I  probably  learned  the  most  from.  To  fellow 
advisees  Chris  Roman  and  Oscar  Pizarro,  what  a  strange  long  trip  it  has  been  —  I  think 
you’d  be  hard  pressed  to  find  three  other  guys  that  could  have  pulled  it  off  better  as  a 
team  than  we  did.  I’m  glad  for  all  of  the  discussions  about  research  and  life,  hanging 
out  on  movie  nights  with  beer,  and  just  generally  make  this  an  enjoyable  experience.  I’d 
also  like  to  thank  Matt  Walter  for  his  immense  help  in  carrying  out  ideas  on  SEIFs  (along 
with  the  cappuccinos),  and  the  same  goes  for  James  Kinsey  who  helped  me  make  the  JHU 
experiment  a  success  (the  first  round  of  Guinness  is  on  me).  Thanks  also  to  Neil  McPhee 
for  his  expertise  in  “making  things  work” ,  to  the  scientists  and  staff  of  DSL  for  lending  any 
help  they  could  to  students,  to  Marsha  and  Julia  and  everyone  else  in  the  Education  Office 
for  their  amazing  support,  and  to  fellow  cronies  Mike  Jakuba,  Brian  Bingham,  Ali  Can, 
Brendan  Foley,  and  Rich  Camilli  for  always  making  time  to  bounce  ideas  off  of. 

Finally,  I’d  like  to  thank  my  family  for  all  of  the  love  and  support  they  have  given  me 
through  the  years.  To  my  sisters  Janae,  Charla,  and  Jessy  for  always  making  me  laugh,  to 
my  grandparents  Gerald  and  Janet  for  being  such  an  important  part  of  my  life,  and  to  my 
mother  Jeannie  for  inspiring  me  to  dream  big  and  see  the  world.  Lastly,  but  definitely  first, 
thank  you  to  my  8  month  old  son,  Noah,  who  is  a  source  of  tremendous  joy  in  my  life,  and 
to  my  wife,  Karen,  who  is  my  better  nine-tenths  and  has  provided  support  beyond  words 
in  helping  me  achieve  this. 

thank  you  all... 
ryan 


5 


This  work  was  funded  in  part  by  the  CenSSIS  ERC  of  the  National  Science  Foundation 
under  grant  EEC-9986821,  in  part  by  the  Woods  Hole  Oceanographic  Institution  through  a 
grant  from  the  Penzance  Foundation,  and  in  part  by  a  NDSEG  Fellowship  awarded  through 
the  Department  of  Defense. 


6 


CONTENTS 


1  Introduction  19 

1.1  Motivation  .  19 

1.2  A  Review  of  Underwater  Computer  Vision .  23 

1.2.1  Image  Registration .  25 

1.2.2  Mosaicking .  29 

1.2.3  Multiview  Geometry .  32 

1.2.4  Imaging  Constraints  of  the  Underwater  Environment .  35 

1.3  A  Review  of  Underwater  Vehicle  Navigation .  38 

1.3.1  Long  Baseline  Navigation .  38 

1.3.2  Doppler  Velocity  Logs .  39 

1.3.3  Visually-Based  Methods .  41 

1.3.4  Limitations  of  Current  Approaches .  43 

1.4  Thesis  Outline  .  44 

1.4.1  Approach .  44 

1.4.2  Document  Roadmap .  45 

2  Visually  Augmented  Navigation  47 

2.1  Introduction .  47 

2.2  Assumptions  .  49 

2.3  View-Based  SLAM  Estimation  Framework .  50 

2.3.1  Delayed-State  Extended  Kalman  Filter .  51 

2.3.2  Pairwise  Camera  Observation  Model .  53 

2.3.3  Link  Hypothesis .  55 

2.4  Generating  the  5-DOF  Camera  Measurement .  56 

2.4.1  Pairwise  Feature-Based  Image  Registration .  58 

2.4.2  Exploiting  Sensor-Measured  Absolute  Orientation  .  59 

2.4.3  Pose-Constrained  Correspondence  Search .  63 

2.4.4  Are  Pairwise  Camera  Measurements  Correlated? .  68 

2.5  Results .  70 

2.5.1  Real-World  Results:  Stellwagen  Bank .  70 


7 


2.5.2  Experimental  Validation:  JHU  Test  Tank .  75 

2.6  Chapter  Summary .  80 

3  Sparse  Extended  Information  Filters:  Insights  into  Sparsification  83 

3.1  Introduction .  83 

3.1.1  A  Survey  of  Scalable  Feature-Based  SLAM  Algorithms .  84 

3.1.2  The  Information  Form  and  a  New  Class  of  Algorithms .  85 

3.2  Background .  86 

3.2.1  The  Gaussian  Information  Form .  86 

3.2.2  Marginalization  and  Conditioning  .  87 

3.2.3  Controlling  Feature-Based  SLAM  Sparsity .  87 

3.2.4  Robot  Conditional  Independence  from  Passive  Features .  90 

3.3  Sparsification .  91 

3.3.1  SEIF  Sparsification  Rule .  91 

3.3.2  Modified  Sparsification  Rule .  93 

3.3.3  What  happens  if  we  just  leave  m_  alone? .  95 

3.4  Results .  96 

3.4.1  Simulation:  LG  SLAM .  97 

3.4.2  Experimental  Validation:  MIT  Tennis  Courts .  100 

3.5  Discussion .  102 

3.5.1  Imposing  Conditional  Independence  Leads  to  Inconsistency .  102 

3.6  Chapter  Summary .  104 

4  Exactly  Sparse  Delayed-State  Filters  105 

4.1  Background .  105 

4.2  Filter  Mechanics .  107 

4.2.1  State  Augmentation .  107 

4.2.2  Measurement  Updates .  109 

4.2.3  Motion  Prediction .  Ill 

4.2.4  State  Recovery .  112 

4.3  Consistent  Covariance  Recovery .  114 

4.3.1  Setting  the  Stage .  115 

4.3.2  Consistent  Covariances  for  Data  Association  .  116 

4.4  Discussion .  119 

4.4.1  Connection  to  Lu-Milios .  119 

4.4.2  Connection  to  Feature-Based  SLAM  .  119 

4.4.3  Video  Frame  Rates .  120 

4.5  Results .  121 

4.5.1  Experimental  Validation:  VAN  EKF  vs.  VAN  ESDF .  121 

4.5.2  Real-World  Results:  RMS  Titanic .  124 

4.6  Chapter  Summary .  133 

8 


5  Conclusions  137 

5.1  Contributions .  137 

5.2  Failure  Modes .  138 

5.3  Future  Work  .  139 

A  Robot  System,  Models,  and  Coordinate  Frames  145 

A.l  Platform .  145 

A. 2  6-DOF  Coordinate  Frame  Relationships .  147 

A. 2.1  Pose  Description .  147 

A. 2. 2  Head-to-Tail  Operation .  149 

A. 2.3  Inverse  Operation  .  150 

A. 2.4  Tail-to-Tail  Operation .  152 

A.3  Vehicle  Model .  153 

A. 3.1  6-DOF  Continuous-Time  Constant  Velocity  Process  Model .  153 

A. 3. 2  Continuous-Time  to  Discrete-Time  Mapping  .  155 

A. 4  Sensor  Observation  Models .  156 

A.4.1  DVL  Observation  Model .  156 

A.4.2  Angular  Rate  Sensor  Observation  Models .  158 

A. 4.3  Attitude  Sensor  Observation  Model .  158 

A.  4.4  Depth  Sensor  Observation  Model . 159 

B  Accompanying  Derivations  for  SEIFs  161 

B. l  Mechanics  of  the  Information  Form .  161 

B. 1.1  Marginalization .  162 

B.1.2  Conditioning .  163 

B.2  SEIF  Sparsification  Rule .  164 

B.2.1  Calculation  of  p/i(xt,  m°,  m+|m“  =  a) .  164 

B.2.2  Calculation  of  pe(xt,  m+|m~  =  a)  .  164 

B.2.3  Calculation  of  pc (m+ 1  m-  =  a) .  165 

B.2.4  Calculation  of  pp(m°,  m+, m_)  166 

B.2.5  Calculation  of  psEiF(xt,  m°,  m+,  m_) .  166 

B.3  Modified  Sparsification  Rule .  168 

B.3.1  Calculation  of  p[/(xf,  m+) .  169 

B.3.2  Calculation  of  pv/(m+) .  169 

B.3.3  Calculation  of  pModRuie(xt!  m°i  m+>  m~)  .  169 


9 


LIST  OF  FIGURES 


1-1  A  survey  of  state-of-the-art  vehicle  designs  for  deep-ocean  science .  20 

1-2  The  trajectories  of  a  multi-phase  deep-ocean  survey  by  the  ABE  AUV  ...  21 

1-3  Some  deep-water  applications  for  AUVs  .  22 

1-4  An  illustration  of  typical  swath  sensors  onboard  AUVs .  24 

1-5  Example  of  typical  feature  extraction  methods  on  underwater  imagery  ...  26 

1-6  The  optical  flow  ambiguity  between  small  rotations  and  small  translations  .  28 

1-7  Example  of  adaptive  histogram  specification  on  underwater  imagery  ....  28 

1-8  Fourier-based  image  registration  example .  30 

1-9  An  example  of  scene-induced  image  parallax .  31 

1-10  Epipolar  geometry .  33 

1-11  Epipolar  demonstration  for  a  pair  of  underwater  images .  33 

1-12  A  demonstration  of  underwater  backscatter .  34 

1-13  AUVs  must  carry  their  own  light  source .  36 

1-14  Backscatter  and  light  limitations  dictate  maximum  vehicle  altitude .  37 

1-15  Typical  LBL  cycle  for  ROV  navigation .  39 

1-16  A  DVL  is  shown  in-situ  on  the  SeaBED  AUV .  40 

1-17  A  mosaic-based  navigation  strategy .  42 

1- 18  Failure  of  a  MBN  strategy .  44 

2- 1  The  objective  of  VAN .  48 

2-2  System  diagram  for  a  view-based  representation .  50 

2-3  The  pairwise  5-DOF  camera  measurement .  55 

2-4  Calculation  of  pairwise  overlap  for  link  hypothesis  .  56 

2-5  Illustration  of  a  pinhole  camera  model .  58 

2-6  An  overview  of  the  pairwise  image  registration  engine .  60 

2-7  Typical  output  from  the  pairwise  feature-based  image  registration  module  .  61 

2-8  Synthetically  normalizing  the  camera  orientation  via  the  infinite  homography  63 

2-9  Pose-constrained  correspondence  regions  for  a  pair  of  underwater  images  .  .  67 

2-10  A  pose-constrained  candidate  correspondence  matrix .  68 

2-11  A  depiction  of  Stellwagen  Bank’s  topography .  71 

2-12  Time-evolution  of  the  Stellwagen  Bank  pose  network  uncertainty .  73 


2-13  Comparison  of  VAN  and  DR  recovered  trajectories .  74 

2-14  VAN  recovers  a  consistent  trajectory  despite  an  unmodeled  heading  bias  .  .  75 

2-15  The  JHU  experimental  setup  .  76 

2-16  The  dual-light  setup  used  on  the  JHU  ROV .  76 

2-17  JHU  results  comparing  DR  and  VAN  trajectories  to  ground-truth .  77 

2-18  Uncertainty  characteristics  of  VAN  versus  DR  for  the  JHU  tank  dataset  ...  79 

2- 19  An  EKF’s  processing  time  grows  quadratically  with  state  size .  81 

3- 1  Feature- based  SLAM  exhibits  a  “natural”  sparsity  in  the  information  form  .  85 

3-2  A  graphical  explanation  of  SEIFs  methodology  for  controlling  sparsity  ...  89 

3-3  An  illustration  of  SEIF’s  concept  of  active  and  passive  features .  91 

3-4  Leaving  m-  alone  in  the  conditioning  reactivates  passive  features .  96 

3-5  LG  SLAM  simulation:  NEES  for  the  vehicle  .  98 

3-6  LG  SLAM  simulation:  NEES  for  a  feature .  99 

3-7  LG  SLAM  simulation:  comparison  of  filter  covariances .  101 

3-8  The  MIT  hurdles  experiment  .  102 

3- 9  MIT  hurdles  experiment:  comparison  of  the  EKF  and  IF  SLAM  maps  ....  103 

4- 1  View-based  SLAM  is  exactly  sparse .  110 

4-2  ESDF  motion  prediction  is  constant-time .  112 

4-3  ESDF’s  connection  to  feature-based  SLAM .  120 

4-4  View- based  SLAM  with  video  frame  rates .  121 

4-5  Density  of  the  covariance  matrix  versus  sparsity  of  the  information  matrix  .  122 

4-6  Time  comparison  of  EKF  vs.  ESDF  filtering  operations .  123 

4-7  Institute  for  Exploration  ROV  Hercules  and  its  sensor  characteristics  ....  124 

4-8  Mapping  results  from  a  ROV  survey  of  the  RMS  Titanic .  125 

4-9  Time-evolution  of  the  RMS  Titanic  pose  constraint  network .  127 

4-10  The  recovered  XY  trajectory  and  covariance  bounds  for  the  RMS  Titanic  .  .  128 

4-11  The  triangulated  structure  for  the  RMS  Titanic  using  VAN’s  pose  estimates  129 
4-12  The  triangulated  point  cloud,  resulting  Delaunay  surface,  and  texture  mapped 

rendering  for  the  RMS  Titanic .  130 

4-13  A  histogram  of  VAN’s  estimated  precision  for  the  RMS  Titanic  pose-network  131 
4-14  A  quantitative  comparison  of  the  different  covariance  recovery  techniques  .  132 

4-15  Performance  of  the  different  covariance  recovery  techniques  within  the  con¬ 
text  of  image  registration .  134 

4- 16  The  same  demonstration  as  Fig.  4-15,  but  for  a  different  image  pair  ....  135 

5- 1  A  multi-image  joint-correspondence  search .  141 

5-2  Localizing  in  a  previously  mapped  area  fills  in  the  ESDF  information  matrix  143 

A-l  The  SeaBED  AUV .  146 

A-2  Illustration  of  the  relevant  navigation  frames .  148 

A-3  General  coordinate  frame  relations  for  three  arbitrary  frames  i,  j,  and  k  .  .  149 


12 


LIST  OF  TABLES 


1.1  Current  off-the-shelf  underwater  navigation  sensors  and  systems .  24 

2.1  Typical  pose  sensor  characteristics  for  underwater  platforms .  50 

3.1  Summary  of  the  marginalization  and  conditioning  operations  for  the  Gaus¬ 
sian  distribution  as  expressed  in  both  the  covariance  and  information  form  .  87 

A.l  Specifications  of  the  SeaBED  AUV  platform .  146 


LIST  OF  ALGORITHMS 


2.1  View- based  link  hypothesis .  57 

2.2  Pose-constrained  correspondence  search .  66 

4.1  Calculation  of  marginal  covariance  bounds  used  for  data  association  ....  119 

A.l  Piecewise-constant  continuous-time  to  discrete-time  mapping .  157 


13 


14 


LIST  OF  ACRONYMS 


ABE 

Autonomous  Benthic  Explorer 

AUV 

autonomous  underwater  vehicle 

BCC 

brightness  constancy  constraint 

CCD 

charge  coupled  device 

Cl 

covariance  intersection 

CG 

conjugate  gradients 

CT 

continuous-time 

DOF 

degree  of  freedom 

DR 

dead-reckoned 

DSL 

Deep  Submergence  Laboratory 

DT 

discrete-time 

DVL 

Doppler  velocity  log 

ESDF 

exactly  sparse  delayed-state  filter 

EM 

electromagnetic 

EKF 

extended  Kalman  filter 

EIF 

extended  information  filter 

FOG 

fiber  optic  gyro 

FOV 

field  of  view 

FFT 

fast  Fourier  transform 

GMRF 

Gaussian  Markov  random  field 

GPS 

global  positioning  system 

IF 

information  filter 
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CHAPTER  1 


Introduction 


1.1  Motivation 

/  %  UTONOMOUS  underwater  vehicles  (AUVs)  are  an  emerging  and  enabling  scientific  tech- 
/  %  nology  that  have  seen  significant  advances  and  growth  of  user  community  over  the 

past  decade.  A  brief  survey  of  recent  literature  shows  the  far-ranging  impact 
AUVs  have  had  in  the  research  community.  Applications  include  high-resolution  geolog¬ 
ical  mapping  [153, 174-176]  by  ABE  [13],  coral  reef  habitat  characterization  [145, 146]  by 
SeaBED  [147],  under- ice  ocean  exploration  [15]  by  Autosub  [59],  and  successful  survey  opera¬ 
tions  [165]  by  HUGIN  [62].  See  Fig.  1-1  for  a  depiction  of  the  different  vehicles.  The  growing 
popularity  of  AUVs  arises  from  their  unmanned  and  untethered  design  which  makes  them 
well  suited  to  extended  exploratory  surveys  requiring  minimal  user  intervention  and  sup¬ 
port.  Meanwhile,  their  autonomous  free-swimming  capability  has  added  a  new  paradigm  of 
ocean  sampling  to  the  scientific  user  community  as  demonstrated  by  Fig.  1-2.  They  com¬ 
plement  the  capabilities  of  tethered  remotely  operated  vehicles  (ROVs)  like  Jason-2  [35]  and 
free-swimming  manned  submersibles  like  Alvin  [31,129],  both  of  which  are  well  suited  to 
intensive  multi-sensor  imaging  and  sampling  of  relatively  small  work  areas. 

The  scientific  user  community  has  begun  to  embrace  and  exploit  AUVs  for  their  capacity 
to  perform  extended,  exploratory,  adaptive  ocean  sampling  and  mapping  surveys.  A  precis 
of  the  diverse  mission  scenarios  for  which  they  are  being  deployed  includes  hydrothermal 
vents  and  spreading  ridges  [153,174-176],  chemical  plume  mapping  [47,66],  studies  of  bio¬ 
diversity  [133],  underwater  forensics  [69,89,148],  deep-water  archeology  [5,144,178],  and 
the  monitoring  of  coral  reefs  [145].  Fig.  1-3  illustrates  a  few  of  these  applications.  For  max¬ 
imum  utility,  scientists  typically  require  that  AUVs  be  capable  of  georeferencing  both  the 
real-time  survey  and/or  the  collected  data  for  post-processing.  However,  depending  on  the 
requisite  spatial  precision  and  desired  survey  extent,  this  requirement  can  pose  a  significant 
challenge  due  to  the  lack  of  easily  obtainable  large-area  underwater  precision  navigation. 

While  the  underwater  realm  presents  its  own  peculiar  challenges  to  autonomous  nav¬ 
igation,  the  lack  of  easily  obtainable  precision  navigation  is  by  no  means  limited  to  the 
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Figure  1-1  A  survey  of  state-of-the-art  vehicle  designs  for  deep-ocean  science. 


sub-sea  domain  of  robotics.  In  fact,  the  past  decade  of  robotics  literature  shows  that,  in 
general,  a  fundamental  issue  of  current  interest  in  robotics  is  a  solution  to  the  joint  task  of 
simultaneous  localization  and  mapping  (SLAM)  —  a  capability  considered  to  be  a  key  pre¬ 
requisite  of  truly  autonomous  robots.  SLAM  represents  a  “chicken  and  egg”  problem  where 
the  concept  is  deceptively  simple.  The  robot’s  goal  is  to  be  able  to  autonomously  navigate 
through  an  a  priori  unknown  environment.  To  do  this,  it  tries  to  navigate  much  like  a 
person,  by  building  a  “mental  map”  of  distinguishable  landmarks  in  the  environment  that 
can  be  easily  recognized  when  revisited.  In  this  way,  whenever  the  robot  gets  “lost”  (i.e., 
accrues  a  lot  of  error  in  where  it  thinks  it  is),  if  it  can  sight  one  of  its  previously  identified 
landmarks  it  can  figure  out  where  it  is  with  respect  to  where  it  has  been.  The  difficulty 
here  is  that  the  robot  never  quite  knows  its  position  exactly  when  building  the  map  (due  to 
accumulation  of  small  errors  while  navigating)  which  is  further  complicated  by  the  fact  that 
its  perceptual  measurements  of  landmarks  are  never  perfect  (due  to  sensor  inaccuracies). 
The  net  effect  of  these  coupled  errors  is  that  the  very  map  that  the  robot  is  trying  to  use  to 
help  improve  navigation,  inherently  has  distortions  in  it  due  to  localization  errors  during  its 
construction,  which  then  affects  its  ability  to  navigate  —  hence,  the  simultaneous  nature  of 
the  problem. 

In  essence,  SLAM  involves  a  joint-estimation  problem  over  pose  and  map  and  has  at¬ 
tracted  a  flurry  of  research  over  the  past  decade  and  a  half  since  the  seminal  work  by 
Moutarlier  and  Chatila  [110]  and  Smith,  Self,  and  Cheeseman  [154,155].  Since  that  time, 
significant  advances  have  been  made  in  dealing  with  severed  fundamental  issues  such  as 
environmental  scalability  [11,60,85,86,108,160]  (i.e.,  how  many  landmarks  can  the  robot 
maintain  in  its  map),  data  association  [87,88,118]  (i.e.,  the  problem  of  establishing  land- 
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Figure  1-2  This  figure  shows  the  trajectories  of  a  multi-phase  deep-ocean  survey  by  the  ABE 
AUV  [82].  Plotted  on  top  of  the  corresponding  bathymetric  contour  plot  are  the  AUV  navigation 
tracklines  from  a  hydrothermal  vent  survey  of  the  Lau  Basin  in  the  South  Pacific.  The  different  color 
tracklines  represent  multiple  spatial  resolution  phases  of  the  survey.  As  highlighted  by  the  inset,  each 
successive  phase  is  conducted  at  a  finer  scale  as  ABE  “homes  in”  on  hydrothermal  vent  signatures. 
AUVs  are  especially  suited  for  this  task  because  they  are  both  autonomous  and  free-swimming  (i.e., 
unlike  ROVs  they  are  not  constrained  by  human  fatigue  and  ship  tether  restrictions). 
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Figure  courtesy  Michael  Jakuba  and  Dana  Yoerger. 
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Figure  1-3  Some  deep-water  applications  for  AUVs. 


(a)  Monitoring  the  health  of  deep-water  corals  [145].  (b)  Forensic  surveys  of  ship  or  plane  wrecks  [69]. 


(d)  Geological  surveys  of  spreading  mid-ocean  ridges  [174]. 


(c)  Deep-water  archeology  [5]. 
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mark  correspondence  to  measurements),  and  map  representation  [94, 109, 154]  (i.e.,  how  to 
model  the  environment  or  landmarks  within  it).  These  advances  have  lead  to  the  demon¬ 
stration  of  impressive  large-scale  autonomous  map  making  under  challenging  circumstances 
including  large  cyclic  environments  and  poor  odometry  [11, 160, 161],  ant)  represent  a  sig¬ 
nificant  fundamental  achievement  in  our  collective  understanding  of  navigation  with  mobile 
robotics. 

Thus,  a  SLAM  framework  seems  like  a  natural  choice  for  overcoming  the  current  naviga¬ 
tion  limitations  in  the  underwater  domain  so  that  we  can  better  support  near-seafloor  ocean 
science.  However,  in  trying  to  adopt  a  SLAM  methodology  for  AUV  navigation,  a  number 
of  constraints  quickly  come  to  the  fore.  First,  a  large  portion  of  the  prior  work  in  the  SLAM 
literature  has  relied  upon  high-bandwidth,  high-precision  laser  scanners  as  the  perceptual 
sensor  of  choice  for  constructing  accurate  maps.  Unfortunately,  the  strong  attenuation  of 
electromagnetic  (EM)  waves  in  the  underwater  realm  generally  limits  our  terrain  sensing 
abilities  to  either  an  acoustic  modality  (frequency-dependent  range  and  resolution)  or  near¬ 
field  vision  (1-5  m)  [149].  Secondly,  most  mapped  environments  in  the  SLAM  community 
are  man-made,  geometrically  simple,  indoor  office  spaces  where  a  2D  map  representation  is 
sufficient  and  landmark  features  abound.  However,  in  the  underwater  realm  science  drives 
the  requirement  that  we  must  be  able  to  navigate  in  3D,  rugged,  unstructured,  natural 
environments  exercising  full  6  degree  of  freedom  (DOF)  motion  [144, 145, 153, 174-176, 178]. 
Hence,  making  SLAM  a  viable  framework  for  improving  AUV  navigation  requires  general 
advances  to  overcome  the  particular  constraints  associated  with  a  marine  environment. 

While  the  issues  above  pose  significant  challenges  when  employing  SLAM  in  the  under¬ 
water  domain,  AUVs  themselves  also  offer  some  distinct  advantages.  For  one,  they  tend  to 
be  well  instrumented  with  advanced  suites  of  inertial  navigation  sensors.  This  sensor  suite 
may  include  a  Doppler  velocity  log  (DVL)  [170]  for  measuring  seafloor  referenced  velocities 
with  mm/s  precision  and/or  a  North-seeking  fiber  optic  gyro  (FOG)  [53]  as  a  sub-degree 
heading  reference  (Table  1.1).  Secondly,  AUVs  are  typically  used  for  studying  benthic  pro¬ 
cesses  (e.g.,  hydrothermal  vents  and  spreading  ridges  [174],  deep-ocean  corals  [146],  gas- 
blowout  structures  off  the  coastal  shelf  [66])  and  as  such,  they  routinely  collect  overlapping 
imagery  of  the  seafloor  using  high-dynamic  range  CCDs.  This  implies  that  we  can  expect 
near-bottom  AUVs  to  be  equipped  with  a  calibrated  camera  system  in  addition  to  other 
swath  sensors  [125]  —  see  Fig.  1-4  for  an  illustration.  Hence,  our  approach  for  overcoming 
near-seafloor  navigation  limitations  has  been  to  embrace  a  SLAM  framework  while  explic¬ 
itly  exploiting  the  available  sensor  suite  and  rich  calibrated  visual  imagery  that  is  routinely 
collected  during  benthic  underwater  surveys. 


1.2  A  Review  of  Underwater  Computer  Vision 

Computer  vision  is  a  broad  research  field  that  encompasses  a  diverse  range  of  theory  and 
application.  Here,  the  discussion  is  restricted  to  areas  that  are  essential  to  the  understanding 
of  underwater  real-time  visually- based  navigation  including  aspects  of  image  registration, 
mosaicking,  epipolar  geometry,  and  constraints  peculiar  to  the  marine  environment. 
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Table  1.1  Current  off-the-shelf  underwater  navigation  sensors  and  systems. 


INSTRUMENT 

VARIABLE 

INTERNAL? 

UPDATE  RATE 

PRECISION 

RANGE 

DRIFT 

Acoustic  Altimeter 

Z  -  Altitude 

yes 

varies:  0.1-10  Hz 

0.01-1.0  m 

varies 

— 

Pressure  Sensor 

Z  -  Depth 

yes 

medium:  1  Hz 

0.01% 

full-ocean 

— 

Inclinometer 

Roll  ic  Pitch 

yes 

fast:  1-10  Hz 

0.1-1° 

±45° 

— 

Magnetic  Compass 

Heading 

yes 

medium:  1-2  Hz 

1-10° 

360° 

— 

Gyro  Compass 

Heading 

yes 

fast:  1-10  Hz 

o 

o 

360° 

10°/h 

Ring-Laser  Gyro 

Heading 

yes 

fast:  1-1000  Hz 

0.0018° 

360° 

0.44°/h 

Bottom- Lock  Doppler 

XYZ  -  Velocity 

yes 

fast:  1-5  Hz 

0.2-1. 0% 

30-200  m 

— 

12  kHz  LBL 

XYZ  -  Position 

NO 

varies:  0. 1-1.0  Hz 

0.01-10  m 

5-10  km 

300  kHz  LBL 

XYZ  -  Position 

NO 

1. 0-5.0  Hz 

±0.002  m  typical 

100  m 

— 

Adapted  from  Whitcomb 


172]  and  Singh  [151]. 


Figure  1-4  An  illustration  of  the  different  types  of  swath  sensors  typically  available  on  large  AUVs. 


1.2.1  Image  Registration 

Image  registration  is  a  fundamental  task  in  computer  vision,  both  at  the  micro  level  (e.g., 
pairwise  registration)  and  the  macro  level  (e.g.,  photogrammetry  and  large-scale  bundle 
adjustment).  Its  objective  is  to  relate  two  or  more  views  of  the  same  scene  taken,  for 
example,  at  different  times,  from  different  modalities  (e.g.,  optical  and  infrared),  or  from 
different  viewpoints.  This  task  has  application  across  many  different  disciplines  spanning 
real-time  target  recognition  and  tracking  [162],  matching  image  pairs  for  recovering  camera 
ego- motion  and  scene  structure  [157],  aligning  images  from  different  modalities  in  medical 
diagnosis  [91],  and  quantifying  scene  change  detection  [81]. 

In  its  most  general  form,  image  registration  involves  determining  a  mapping  between 
images  both  spatially  and  with  respect  to  intensity  [17].  Defining  an  image  as  a  two- 
dimensional  array  /(u,  v)  where  the  spatial  indices  (tt,  v)  map  to  a  respective  intensity,  then 
the  mapping  between  images  I,  and  Ij  can  be  expressed  as 

Ij(u,v)  =  g(Ii(f(u,v))) 

Here,  g(.)  is  a  ID  intensity  or  radiometric  transformation  while  /(.)  is  a  2D  spatial  transfor¬ 
mation  mapping  coordinates  ( u ,  v)  to  new  coordinates  (u1,  v')  =  f(u,  v).  Generally  speaking, 
the  radiometric  mapping  g(.)  is  explicitly  considered  when  mapping  from  one  modality  to 
another  where  pixel  intensities  do  not  correspond  to  the  same  measurement  (e.g.,  registering 
optical  imagery  to  infrared  imagery  [72],  or  registering  video  imagery  to  a  3D-model  depth 
map  [166]).  It  is  also  relevant  in  imagery  where  the  scene  illumination  varies  and  correspond¬ 
ing  image  points  may  not  have  the  same  intensity.  For  example,  in  deep-sea  underwater 
imagery  vehicles  must  carry  their  own  illumination,  resulting  in  varying  scene  brightness. 
Without  loss  of  generality  one  can  drop  the  explicit  modeling  of  the  radiometric  mapping 
g(.)  and  instead  focus  solely  on  the  spatial  registration  /(.)  (i.e.,  Ij(u,v )  =  Ii(f(u,v)))  not¬ 
ing  that  the  radiometric  mapping  may  either  be  considered  as  a  preprocessing  step  [149], 
or  incorporated  into  the  spatial  registration  technique  directly  [72] . 

The  spatial  registration  of  images  by  the  mapping  f(u,v)  is  generally  based  upon  a 
motion-model  [9, 80] .  Two-dimensional  global  parametric  motion-models  are  a  useful  class 
of  mappings  which  can  be  applied  across  the  whole  image  and  are  often  used  in  mosaicking. 
These  global  transformations  define  a  displacement  (Au,  Av)  for  every  pixel  (u,  v)  in  the 
image  and  range  across  increasing  complexity  from  rigid,  affine,  projective,  perspective, 
and  polynomial  transforms  [17].  The  different  techniques  used  to  determine  the  motion- 
model  parameters  can  generally  be  divided  into  two  classes  —  indirect  methods  and  direct 
methods  [17]. 

Indirect  Methods 

Indirect  or  feature-based  methods  generally  rely  upon  condensing  the  large  amount  of  image 
information  into  a  small  subset  of  feature  tokens  thereby  reducing  the  amount  of  correspond¬ 
ing  data  to  be  matched.  The  first  step  in  determining  features  is  to  apply  some  form  of  fea¬ 
ture  extraction  to  the  image.  This  operation  is  generally  desired  to  be  invariant  to  a  certain 
degree  of  image  distortions,  such  as  rotation  and/or  scaling,  so  that  the  same  interest  point 
may  be  picked  out  in  both  the  reference  and  input  images  [140].  Operators  include  edges 
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Figure  1-5  This  example  illustrates  some  of  the  different  types  of  features  that  can  be  extracted 
from  underwater  imagery  using  standard  feature  detectors. 


(a)  Original  image.  (b)  Canny  edge  detector. 


and  contours  [22,90],  corners  [63],  extremal  regions  [100],  and  scale-space  maxima  [93,106] 
as  such  operators  extract  the  intrinsic  local  structure  of  an  image.  Examples  of  well-known 
operators,  as  illustrated  in  Fig.  1-5,  include  the  Canny  edge  detector  [22],  the  Harris  corner 
detector  [63],  and  most  recently,  the  scale  invariant  feature  transform  (SIFT)  [93].  The  main 
goal  of  these  operators  is  to  pick  out  features  of  local  interest  in  the  image  that  contain 
information  indicating  the  presence  of  easily  distinguishable  and  meaningful  characteristics 
in  the  scene. 

Once  features  have  been  extracted,  the  next  step  is  to  establish  their  putative  correspon¬ 
dence  across  overlapping  imagery.  Traditional  methods  establish  feature  correspondence  by 
optimizing  some  type  of  local  similarity  metric  such  as  correlation  or  equivalently  sum  of 
squared  differences  (SSD)  [55].  While  2D  correlation  is  computationally  cheap  as  a  sim¬ 
ilarity  metric,  it  fails  if  feature  regions  differ  by  moderate  rotations  or  scale  differences. 
To  overcome  these  limitations,  more  advanced  techniques  rely  upon  encoding  some  form 
of  locally  invariant  feature  descriptor.  Differential  invariants  such  as  those  described  by 
Schmid  [140],  generalized  image  moments  based  upon  Zernike  polynomials  [3,77,126],  and 
Lowe’s  scale  invariant  feature  transform  [93]  are  all  robust  to  scale  and  rotation.  For  more 
challenging  registration  problems,  though,  such  as  wide-baseline  stereo  applications,  affine 
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(d)  SIFT  features. 


invariant  feature  descriptors  have  proven  to  be  the  most  robust  for  establishing  correspon¬ 
dences  [106, 139, 164]  (although  they  are  also  the  most  expensive  computationally). 

The  last  step  in  the  registration  process  is  to  fit  a  parametric  motion-model  that  de¬ 
scribes  the  feature  correspondences.  Since  the  putative  correspondence  stage  is  often  prone 
to  error,  robust  outlier  detection  methods,  such  as  least  median  of  squares  (LMedS)  [134] 
or  random  sample  consensus  (RANSAC)  [42],  are  typically  employed  in  an  iterative  fashion 
to  find  a  consistent  inlier  set  and  initial  fit  for  the  motion-model.  Having  done  this,  a 
maximum  likelihood  estimate  (MLE)  [64]  is  then  typically  performed  as  the  final  step. 

Direct  Methods 

In  contrast  to  feature-based  indirect  methods,  direct  methods  work  directly  on  the  entire 
image  to  estimate  the  motion-model  by  computing  measurable  quantities  from  the  raw  pixel 
(intensity  or  color)  values.  A  standard  and  well  known  technique  has  been  the  application  of 
the  brightness  constancy  constraint  (BCC)  [67],  which  assumes  that  an  image  point  or  small 
region  corresponding  to  a  particular  scene  point  or  surface  patch  remains  approximately 
constant  during  the  motion  of  the  camera  relative  to  the  scene.  This  model  is  exact  for 
Lambertian  surfaces  that  are  stationary  with  respect  to  the  illumination  source  in  the 
presence  of  a  moving  camera.  The  technique  relies  upon  measuring  the  spatio-temporal 
image  gradient  (commonly  referred  to  as  optical  flow)  to  estimate  the  image  motion  and, 
thus,  generally  requires  video  frame  rates  to  satisfy  the  BCC’s  differential  assumptions  of 
spatial  and  temporal  smoothness  [112].  One  of  the  weak  points  of  this  method  for  visual 
navigation  is  that  it  is  susceptible  to  motion  drift  when  integrated  over  time.  This  is 
particularly  true  for  narrow  field  of  view  (FOV)  cameras  where  the  optical  flow  field  is 
ambiguous  for  small  translations  parallel  to  the  image  plane  versus  small  rotations  along 
the  pan  and  tilt  axes  (Fig.  1-6).  While  this  method  has  been  applied  with  good  success  in 
terrestrial  applications  where  lighting  is  typically  more  uniform,  it  performs  poorly  when 
naively  applied  to  underwater  imagery  due  to  a  severe  violation  of  the  assumption  that 
illumination  be  stationary  with  respect  to  the  scene.  Illumination  of  deep-sea  imagery  is 
necessarily  time-varying,  as  vehicles  have  to  carry  their  own  light  source.  This  results  in 
varying  scene  irradiance  across  images,  and  in  moving  shadows.  Negahdaripour  [113,117] 
has  attempted  to  salvage  the  BCC,  though,  by  incorporating  a  model  for  affine  varying 
scene  irradiance. 

Other  researchers  have  approached  the  problem  of  time-varying  illumination  by  first 
transforming  the  images  into  an  illumination- invariant  representation  [72,149].  In  [72]  local 
normalized  correlation  is  used  in  a  pyramidal  approach  of  maximizing  normalized  corre¬ 
lation  surfaces  to  estimate  the  motion-model  where  the  multi-resolution  implementation 
allows  for  larger  inter-image  displacements  [9].  This  technique  was  originally  developed  for 
multi-sensor  fusion  of  optical  and  infrared  imagery  and  has  been  successfully  applied  un¬ 
derwater  in  at  least  one  known  structure- from-motion  (SFM)  application  [97].  Meanwhile, 
in  [149]  a  technique  for  underwater  imagery  is  developed  by  first  preprocessing  using  adap¬ 
tive  histogram  specification  [37,182]  (Fig.  1-7)  followed  by  gray-level  thresholding  to  detect 
and  discount  shadow  regions  in  an  attempt  to  account  for  lighting  variations  before  attempt¬ 
ing  image  registration.  The  effect  of  this  preprocessing  step  is  the  masking  of  shadows  and 
equalization  of  image  contrast,  which  reduces  the  effects  of  lighting  patterns. 

27 


Figure  1-6  Optical  flow  methods  for  narrow  FOV  cameras  suffer  from  a  visual  ambiguity  between 
small  camera  translations  versus  small  rotations.  This  simulated  optical  flow  field  was  calculated 
for  a  small  camera  movement  over  a  planar  scene  oriented  parallel  to  the  image  plane.  The  two 
fields  look  nearly  identical  towards  the  center  of  the  image  (indicated  by  the  black  square).  Note 
that  most  direct  methods  only  process  the  central  region  of  the  image,  both  for  efficiency  and  to 
minimize  radial  distortion  effects. 


(a)  Small  translation  to  the  left. 


left. 


Figure  1-7  Varying  scene  illumination  adds  an  additional  challenge  to  underwater  image  registra¬ 
tion.  In  this  example  the  original  image  is  preprocessed  using  contrast- limited  adaptive  histogram 
specification  to  compensate  for  vehicle  lighting  patterns  before  attempting  to  register  imagery. 


(a)  Raw  image  collected  by  ABE  on  a  geological  survey  (b)  Equalized  imagery  using  the  algorithm  in  [37]. 
of  a  spreading  mid-ocean  ridge. 
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Correlation  and  its  variants  also  fall  under  the  umbrella  of  direct  methods  and  are  based 
upon  finding  the  extrema  of  some  form  of  signal  similarity  measure.  From  a  computational 
standpoint,  these  techniques  can  be  used  when  the  images  appear  to  be  mostly  displaced 
and,  thus,  have  undergone  little  rotation  or  distortion  so  that  a  2D  translational  search 
space  is  sufficient.  Unfortunately,  image  distortions  such  as  rotation  and  scaling  are  com¬ 
mon  in  underwater  imagery,  as  vehicle  surveys  are  often  unstructured  or  navigated  with 
low  resolution  navigation  [127].  In  order  to  handle  these  higher-dimensional  search  spaces, 
correlation-based  techniques  can  be  posed  more  efficiently  in  the  frequency  domain  by  ex¬ 
ploiting  the  phase  shifting  property  of  the  Fourier  transform  to  handle  large  rotations  and 
scale  changes. 

The  phase  shifting  property  of  the  Fourier  transform  states  that  signals  that  are  spatially 
shifted  will  result  in  transforms  that  are  shifted  in  phase: 

hj(u,  v)  =  hi(u  —  A u,  v  —  Av)  spatial  domain 

frequency  domain 

This  property  can  easily  be  exploited  to  recover  the  unknown  translation  ( A?x,  Av)  for  2D 
images  and  can  also  be  extended  to  recover  scale  and  rotation  by  representing  these  pa¬ 
rameters  in  a  coordinate  system  where  they  appear  as  shifts  as  described  in  [37, 127, 132]. 
Even  more  general  affine  motion-models  can  be  represented  by  making  use  of  additional 
Fourier  properties  [79,95].  These  techniques  benefit  computationally  by  making  use  of  the 
fast  Fourier  transform  (FFT)  and  are  insensitive  to  isolated  frequency  dependent  image 
noise  such  as  low-frequency  illumination  differences.  Drawbacks,  however,  are:  1)  they 
require  a  large  overlap  between  image  pairs  to  accommodate  common  area  frequency  rep¬ 
resentations,  and  2)  only  linear  motion-models  can  be  described  using  Fourier  transform 
properties.  Fig.  1-8  demonstrates  applying  the  frequency  domain  technique  to  register  a 
pair  of  underwater  images. 

1.2.2  Mosaicking 

Mosaicking  is  the  task  of  combining  two  or  more  images  such  that  the  resulting  composite 
image  has  an  increased  effective  FOV.  The  problem  has  been  extensively  studied  [73, 80, 
123, 136, 157],  with  early  roots  in  aerial  and  satellite  imaging  where  the  planar  parametric 
motion-model  is  well  approximated  due  to  the  large  separation  between  camera  and  scene. 
Planar  parametric  motion-models  yield  a  composite  image  that  is  theoretically  exact  under 
only  two  conditions:  1)  the  scene  structure  is  arbitrary  and  the  camera  undergoes  rotation 
about  its  optical  center,  or  2)  the  camera  motion  is  arbitrary,  but  the  scene  being  viewed 
is  planar  [76].  Both  of  these  conditions  are  equivalent  to  no  observed  parallax  in  the  input 
images  (Fig.  1-9). 

Temporal  Mosaicking 

Early  methods  in  mosaicking  by  the  computer  graphics  community  approached  the  problem 
in  a  temporally  causal  manner  [73,80, 123, 136, 157].  These  approaches  processed  the  im¬ 
agery  in  a  sequential  manner  to  determine  the  pairwise  homographies  relating  the  temporal 
sequence,  and  constructed  a  composite  view  by  concatenation  (thus,  warping  all  images  to 
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Figure  1-8  This  figure  demonstrates  using  the  2D  Fourier  Transform  technique  to  register  a  pair 
of  underwater  images  collected  during  a  forensic  survey  of  the  wreck  of  the  M/V  Derbyshire  [69]. 


(a)  Raw  underwater  control  image.  (b)  Raw  underwater  input  image. 


(c)  Input  image  registered  into  the  coordinate  frame  of  the  control 
image  via  Fourier  methods  [37]  (average  intensity  is  shown). 
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Figure  1-9  An  example  of  scene-induced  image  parallax.  Images  (a)  and  (b),  denoted  II  and  In, 
correspond  to  two  photos  of  a  wall  in  front  of  the  Cashier’s  Office  at  MIT  taken  from  left  and  right 
vantage  points  respectively.  Image  (d),  denoted  Iw,  is  the  result  of  warping  IR  onto  IL  using  a 
planar  perspective  homography  as  illustrated  in  (c).  Note  that  the  area  common  to  both  (a)  and 
(d)  is  in  agreement  except  for  the  door  jambs,  which  violate  the  planar  scene  assumption.  Image  (e) 
shows  the  pointwise  difference  between  I l  and  Iw  ■  The  discrepancy  is  due  to  parallax. 


(a)  Left  view.  (b)  Right  view. 


(c)  Planar  homography  model. 


(d)  Warp  of  right  to  left.  (e)  Their  difference. 
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a  common  reference  frame).  While  the  pairwise  homographies  accurately  describe  the  local 
registration,  the  small  residual  local  alignment  errors,  coupled  with  errors  in  the  applied 
motion-model,  lead  to  an  amplified  global  error  when  simply  concatenated  over  long  se¬ 
quences.  Since  the  image  to  reference  frame  homographies  calculated  by  compounding  do 
not  attempt  to  achieve  global  consistency,  images  that  are  not  temporal  neighbors,  but  are 
spatial  neighbors,  may  not  be  co-registered  in  the  resulting  mosaic. 

Global  Mosaicking 

More  recent  efforts  have  focused  on  imposing  the  available  non-temporal  spatial  constraints 
to  produce  a  globally  consistent  mosaic  (21, 126, 137, 138].  These  methods  formulate  the 
problem  as  the  optimization  of  a  global  cost  function  parameterized  by  all  of  the  image  to 
mosaic  frame  homography  parameters.  The  mosaic  topology  may  initially  be  derived  in  a 
coarse  manner  assuming  simplified  motion-model  parameters  between  temporally  connected 
neighbors.  From  this  roughly  estimated  topology,  new  spatial  neighbors  are  hypothesized 
and  then  tested.  This  process  is  iterated  until  a  stable  image  topology  emerges.  The 
optimization  of  the  cost  function  incorporates  these  spatial  constraints  to  produce  a  globally 
consistent  mosaic  with  enhanced  quality  and  robustness  as  compared  to  simpler  mosaicking 
methods. 

1.2.3  Multiview  Geometry 
The  Epipolar  Constraint 

While  homographies  can  often  be  a  useful  approximation  for  obtaining  composite  views  over 
an  expanded  FOV  or  for  planar  visual  navigation  [41],  their  shortcoming  is  that  they  can 
model  only  views  of  a  single-plane  environment.  When  the  scene  is  non  planar,  more  general 
descriptions  of  image  motion  must  be  employed.  For  example,  the  epipolar  geometry  for  a 
pair  of  views  is  defined  by  the  relative  camera  pose  and  allows  for  any  ray  from  one  image 
to  be  projected  into  the  view  of  the  other  as  demonstrated  by  Fig.  1-10  and  Fig.  1-11.  For 
a  calibrated  camera  this  geometric  relationship  is  mathematically  encoded  in  the  Essential 
matrix,  which  is  a  3  x  3  matrix  that  maps  homogeneous  normal  coordinates  from  one 
image  into  the  corresponding  homogeneous  epipolar  line  in  the  other.  The  epipolar  line 
encodes  for  all  possible  scene  depths  the  projection  of  a  scene  point  into  the  view  of  the 
other  camera.  It  can  be  used  as  an  efficient  ID  search  constraint  when  trying  to  establish 
correspondences  using  a  stereo  camera  setup  where  the  relative  camera  pose  is  a  known 
fixed  quantity  [29].  In  the  case  of  unknown  calibration  this  relationship  can  be  extended 
more  generally  through  the  Fundamental  matrix  [64, 179],  which  extends  the  concept  of  the 
Essential  matrix  by  incorporating  the  unknown  camera  calibration  into  its  definition  and 
in  recent  years  has  become  the  focus  of  research  due  to  the  growing  popularity  of  variable 
focus  consumer  digital-still  and  video  cameras. 

Structure  from  Motion 

The  most  popular  application  of  multiview  geometry  has  been  in  that  of  offline  structure- 
from- motion  (SFM)  [43,96,124].  SFM  relates  multiple  views  through  either  the  Essential 
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Figure  1-10  Epipolar  geometry  model.  The  epipolar  geometry  is  based  upon  ray  projections 
between  adjacent  views  and  holds  regardless  of  scene  structure.  In  this  example,  the  two  scene 
points  Xi  and  X2  are  projected  into  scene  points  Xi  and  x2  and  x^  and  x!2  in  the  left  and  right 
image  planes  using  camera  projection  matrices  P  and  P'  respectively.  Note  that  each  scene  point  in 
conjunction  with  the  camera  centers  defines  a  plane,  denoted  the  epipolar  plane.  The  intersection  of 
each  epipolar  plane  with  the  image  plane  defines  the  epipolar  line ;  it  represents  the  projection  of  the 
corresponding  scene  ray  as  viewed  by  the  other  camera.  When  the  geometry  between  the  cameras 
(i.e.,  R  and  t)  is  known,  then  the  epipolar  lines  provide  a  ID  search  constraint  for  establishing 
correspondences.  Finally,  note  that  the  set  of  all  epipolar  planes  defines  a  pencil  whose  intersection 
with  the  image  planes  defines  the  epipoles  e  and  e'.  The  line  connecting  the  epipoles  is  called  the 
baseline  and  corresponds  to  the  vector  t. 

x; 


Figure  1-11  This  figure  demonstrates  the  epipolar  geometry  for  a  pair  of  successfully  registered 
underwater  images  of  the  RMS  Titanic.  The  epipolar  lines  are  overlaid  on  the  imagery  and  are  color 
coded  for  correspondence  (the  circles  along  each  line  are  the  matched  feature  points).  Note  that  the 
lines  converge  at  the  epipoles,  which  in  this  case  are  located  outside  of  the  viewable  image. 
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Figure  1-12  A  demonstration  of  underwater  backscatter  using  data  collected  by  the  Jason  ROV 
[149].  In  this  example,  a  sequence  of  images  is  shown  over  an  incremental  range  of  altitudes  to 
demonstrate  the  significance  of  backscatter  in  the  underwater  imaging  process.  Note  that  backscatter 
reduces  the  effective  altitude  at  which  an  underwater  vehicle  can  clearly  image  the  scene.  Practical 
camera-to-light  separations  for  a  typical  AUV  platform  dictate  that  it  must  fly  within  several  meters 
of  the  seafloor  in  order  to  find  a  tolerable  tradeoff  between  backscatter  and  imaged  FOV. 


(e)  9.5  m  (f)  11.0  m  (g)  12.5  m 
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matrix  or  the  Fundamental  matrix  and  its  goal  is  to  recover  both  camera  motion  and 
scene  structure  [163].  In  recent  years,  a  fundamental  research  task  has  been  the  discovery 
of  efficient  and  robust  online  SFM  algorithms  that  are  scalable.  Early  online  methods 
approached  the  problem  in  a  temporally  causal  manner  similar  to  early  mosaicking,  and 
hence  suffered  from  motion  and  scene  drift  over  long  sequences  [8,104,181].  Their  associated 
drift  resulted  from  the  simple  open-loop  construction  of  motion  and  scene  which  failed  to  use 
information  about  revisiting  previously  identified  structure.  Recently,  though,  researchers 
such  as  Davison  [27]  and  Bosse  [10]  have  begun  to  frame  online  SFM  in  a  SLAM  context, 
thereby  benefiting  from  the  explicit  representation  of  joint  motion/structure  estimation. 
In  particular,  Davison  has  recently  shown  impressive  real-time  SFM  results  for  a  wearable 
video  camera  [28],  though  on  a  small  spatial  scale. 

1.2.4  Imaging  Constraints  of  the  Underwater  Environment 
Scattering 

The  underwater  environment  places  unique  constraints  on  the  ability  to  utilize  visual  in¬ 
formation.  The  absorption  and  scattering  of  light  through  the  medium  of  water  was  first 
understood  in  a  physics- based  context  with  the  pioneering  work  by  Duntley  [34].  Duntley 
showed  that  the  propagation  of  light  underwater  suffers  from  a  wavelength-dependent  expo¬ 
nential  attenuation.  In  more  recent  years,  McGlamery  [102]  investigated  the  fundamentals 
of  the  image  formation  process  by  computer  modeling  the  absorption  coupled  with  the  di¬ 
rect,  forward,  and  backscatter  light  components.  Jaffe  [74]  later  extended  McGlamery’s 
work  to  determine  the  idealized  vehicle  lighting  configuration  for  minimal  backscatter  and 
good  scene  illumination.  His  results  for  standard  lighting  configurations  confirmed  that  large 
horizontal  camera-to-light  separations  were  desirable  to  reduce  backscatter  —  the  principle 
cause  being  the  reduction  of  common  volume  between  the  camera  FOV  and  volume  of  pro¬ 
jected  light.  However,  Singh  [149]  recently  showed  that  there  are  theoretical  limits  to  the 
benefits  of  large  camera-to-light  separation  as  applied  to  practical  vehicle  configurations. 
Fig.  1-12  demonstrates  the  range  over  which  backscatter  has  an  effect  for  a  fixed  camera 
and  light  geometry. 

Attenuation 

In  conjunction  with  the  constraint  of  minimizing  backscatter,  the  rapid  attenuation  of 
light  through  water  imposes  additional  challenges  when  collecting  underwater  imagery 
(Fig.  1-13).  Light  attenuation  limits  the  altitude  at  which  a  vehicle  can  fly  from  the  seafloor 
and  collect  imagery.  As  deep-sea  vehicles  are  required  to  carry  their  own  light  sources,  this 
constraint  has  implications  in  both  minimizing  terrain  parallax  effects  and  in  generating 
large-area  imagery  since  the  constraining  altitude  is  typically  3-10  m  [149].  In  addition,  the 
light  source  moves  with  the  vehicle,  leading  to  nonuniform  illumination  and  moving  shadows 
—  both  of  which  pose  additional  challenges  during  image  registration.  Vehicles  are  forced 
to  fly  close  to  the  seafloor  where  terrain  relief  may  be  comparable  to  the  camera  to  seafloor 
separation,  inducing  gross  perspective  changes  (Fig.  1-14).  Also,  each  image  encompasses 
a  small  area  of  the  seafloor,  reducing  the  overall  FOV.  For  mosaicking,  this  implies  that 
many  images  must  be  registered  to  increase  the  effective  FOV,  and  that  terrain  distortion 
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Figure  1-13  The  consequences  of  AUVs  having  to  carry  their  own  light  source,  (a)  Strobed  imagery 
from  energy  constrained  AUVs  often  tends  to  be  light  limited  causing  the  images  to  appear  dark  and 
decrease  in  contrast  towards  the  edges  as  demonstrated  by  the  ABE  imagery  on  the  left.  In  addition, 
the  preferential  absorption  of  red  light  causes  color  images  to  appear  green  as  shown  by  the  SeaBED 
imagery  on  the  right,  (b)  Illumination  from  different  points  can  have  a  pronounced  effect  on  scene 
appearance.  This  cross-track  image  pair  was  collected  by  the  SeaBED  AUV  off  of  Stellwagen  Bank 
National  Marine  Sanctuary  and  illustrates  the  effect  of  illumination  from  reciprocal  headings.  To 
aid  in  interpretation  the  rightmost  image  has  been  pre-rotated  180°to  offset  the  nominal  heading 
difference.  Also,  to  improve  visual  queues  both  images  have  been  color-corrected  as  described  by  [20] 
and  manual  correspondences  have  been  overlaid.  As  an  aside,  note  that  the  image  on  the  right 
corresponds  to  the  raw  uncorrected  color  image  shown  above  it  in  (a). 


(a)  Light  limited  images. 


(b)  Scene  appearance  varies  markedly  with  location  of  light  source. 
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Figure  1-14  Backscatter  and  light  limitations  dictate  that  the  vehicle  must  fly  close  to  the  seafloor 
when  collecting  imagery  [149].  This  reduces  the  effective  FOV  for  each  image,  requiring  many  images 
to  cover  a  given  area.  Low- altitude/close  flying  also  makes  3D  terrain  effects  pronounced. 


effects  become  more  significant. 

Registration 

Image  registration  can  also  be  more  difficult  with  underwater  imagery  than  with  land- 
based  acquired  imagery.  Unstructured  surveys  by  vehicles  with  low- resolution  navigation 
and  heading  inaccuracies  are  common.  This  results  in  imagery  with  gross  motions  between 
temporal  frames,  often  with  minimum  overlap  due  to  strobed  lighting  [12].  In  addition,  the 
types  of  imaged  scenery  can  be  vastly  different  ranging  from  highly  3D  coral  reefs  [147]  to 
featureless  muddy  bottoms  [148].  Man-made  features  such  as  edges,  corners,  and  parallel 
lines,  prominent  in  land-based  images,  cannot  be  reliably  expected  to  occur  in  underwater 
imagery. 

Power 

Power  budget  limitations  of  AUVs  are  also  an  important  consideration  in  the  design  of 
imaging  systems.  The  amount  of  energy  expended  in  illuminating  the  scene  has  a  direct 
negative  effect  on  the  endurance  of  these  battery-limited  vehicles  [12].  Typically,  AUVs 
cannot  afford  to  put  out  the  continuous  lighting  needed  for  video  frame  rates  because  it 
would  come  at  the  sacrifice  of  precious  bottom-time.  Rather,  strobed  lighting  is  often  used 
to  conserve  power  [147, 150].  Additionally,  the  low  amount  of  image  overlap  afforded  by 
this  illumination  scheme  precludes  optical-flow  image  registration  methods  such  as  [114,116]. 
Hence,  the  unique  energy  constraints  of  AUVs  are  a  major  driver  for  the  goal  of  this  thesis 
to  be  able  to  handle  low  overlap  imagery  (i.e.,  15-35%  temporal  overlap). 
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1.3  A  Review  of  Underwater  Vehicle  Navigation 


The  land-based  community  is  able  to  obtain  meter-level  position  accuracy  almost  anywhere 
in  the  world  above  ground  via  the  global  positioning  system  (GPS).  However,  the  attenua¬ 
tion  of  electromagnetic  waves  through  the  medium  of  water  limits  the  application  of  GPS 
to  near  surface  activities.  This  section  reviews  the  available  and  relevant  techniques  of 
underwater  vehicle  navigation  to  establish  the  current  state-of-the-art. 

1.3.1  Long  Baseline  Navigation 

The  standard  for  bounded  xyz  navigational  position  measurements  for  underwater  vehicles 
is  the  long-baseline  (LBL)  acoustic  transponder  system.  LBL  was  originally  developed  at 
the  Woods  Hole  Oceanographic  Institution  (WHOI)  in  the  early  1970’s  and  has  since  then 
become  an  integral  part  of  the  marine  science  community  [71].  LBL  requires  two  or  more 
acoustic  transponder  beacons  to  be  tethered  to  the  seafloor  and  operates  on  the  principle 
of  time-of-flight.  Given  knowledge  of  the  speed  of  sound  in  water  (which  is  of  the  order 
of  1500  m/s),  the  round-trip  travel  time  of  an  acoustic  signal  propagating  between  two 
unobstructed  points  becomes  a  proxy  for  the  line  of  sight  distance  between  them. 

LBL  Setup 

The  first  task  in  setting  up  a  LBL  network  is  to  deploy  the  acoustic  beacons  at  the  site  of 
interest.  This  requires  deploying  the  bottom- tethered  acoustic  transponders  in  a  configu¬ 
ration  that  optimizes  both  the  acoustic  signal  propagation  and  the  geometry  of  the  vehicle 
work  site  [71].  Next  a  sound  velocity  profile  measurement  is  collected  from  the  surface 
ship.  This  profile  measures  the  sound  speed  throughout  the  water  column  and  is  used  in 
all  time-of-flight  calculations  to  compensate  for  vehicle  depth.  Finally,  each  transponder 
in  the  network  is  surveyed  and  placed  in  a  world  frame  of  reference  by  the  surface  ship. 
This  involves  the  surface  ship  acoustically  interrogating  the  transponders  and  recording 
time-of-flight  measurements  to  individual  transponders  while  concurrently  recording  GPS 
position  measurements  as  it  steers  a  survey  pattern  from  the  surface.  Both  the  recorded 
ship  positions  and  transponder  round-trip  travel  times  are  then  processed  to  compute  a 
least-squares  world- referenced  XYZ  position  for  each  transponder  [169]. 

LBL  Navigation  and  its  Characteristics 

A  typical  LBL  configuration  for  AUVs  is  to  have  the  AUV  act  as  the  master  with  the 
transponders  set  as  slaves.  The  master  transmits  on  one  frequency,  say  10  kHz,  and  upon 
receiving  this  signal  each  slave  replies  on  a  unique  frequency,  say  11  kHz  and  12  kHz.  Similar 
to  the  ROV  cycle  shown  in  Fig.  1-15,  this  allows  the  master  to  record  the  two-way  travel 
time  between  it  and  each  slave.  A  typical  LBL  system  will  operate  up  to  a  range  of  10  km 
providing  a  bounded  error  XYZ  position  estimate  with  a  range-dependent  error  measurement 
of  the  order  of  0.1-10  m  [171].  Higher  frequency  LBL  systems  operating  at  300  kHz  exist 
and  are  capable  of  higher  precision  position  bounds  [177].  Although  these  systems  are 
capable  of  sub-centimeter  position  resolution,  they  have  a  maximum  working  range  of  only 
~100  m  as  compared  to  the  typical  12  kHz  LBL  systems  which  have  a  working  range  of 


38 


Figure  1-15  Typical  LBL  cycle  for  a  ROV  deployment  (figure  adapted  from  WHOI-74-6  [71]). 


(a)  Ship  interrogates  ROV  at  9  kHz.  (b)  ROV  receives  interrogation  at  (c)  The  tethered  transponders  re- 

9  kHz  and  replies  at  10  kHz.  ceive  interrogation  at  10  kHz 

and  reply  at  11  and  12  kHz. 


5-10  km  (172).  Milne  [107]  provides  a  more  detailed  discussion  of  the  implementation  and 
working  principles  behind  LBL  and  other  acoustic  positioning  systems. 

1.3.2  Doppler  Velocity  Logs 

Recent  advances  have  been  made  in  the  area  of  dead-reckoned  (DR)  vehicle  navigation  with 
the  advent  of  the  bottom-lock  Doppler  velocity  log  (DVL).  The  DVL  provides  a  measurement 
of  seafloor-referenced  vehicle  velocity,  which  can  be  integrated  over  time  to  provide  XYZ 
positional  information.  The  basic  working  principle  behind  these  bottom-referenced  velocity 
measurements  is  the  acoustic  Doppler  effect,  which  states  that  a  change  in  the  observed 
sound  pitch  results  from  relative  motion.  This  change  in  sound  pitch  is  directly  proportional 
to  the  relative  radial  velocity  between  the  source  and  receiver  and  can  be  used  to  recover 
seafloor-referenced  vehicle  velocity.  Additionally,  a  DVL  can  also  be  used  to  measure  water- 
referenced  velocities. 

DVL  Technology 

Commercially  available  broadband  DVLs  (Fig.  1-16),  as  opposed  to  traditional  continuous- 
tone  DVLs,  make  use  of  time  dilation  to  compute  a  velocity  measurement  from  an  ensemble 
of  “discrete”  pings.  The  use  of  time  dilation  results  in  a  more  accurate  measurement  of 
the  Doppler  shift  with  single  ping  velocity  error  standard  deviations  less  than  1%  [170]. 
When  n-ping  ensemble  averaging  is  performed,  the  standard  deviation  further  decreases  as 
[130].  Most  off-the-shelf  DVLs  use  a  Janus  transducer  configuration  [16],  which  consists 
of  four  downward-looking  acoustic  transducers  each  oriented  at  30°  from  the  vertical  [130] 
(see  inset  of  Fig.  1-16).  In  this  configuration,  each  transducer  measures  the  sensor’s  velocity 
with  respect  to  the  seafloor  as  projected  onto  the  centerline  of  its  acoustic  beam  axis, 
resulting  in  four  measurements  of  beam-component  velocity: 

Vbeam(0=  [»%,(*).  »h(0.  «k«(0]T 

Here,  each  Vb,{t)  represents  a  scalar  measurement  of  sensor  velocity  projected  along  the  ith 
beam  axis  (i.e.,  •  vsensor(f)  where  e,  is  the  unit  vector  in  the  ith  beam  direction). 
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Figure  1-16  A  RD  Instruments  1200  kHz  Workhorse  Navigator  DVL  is  shown  in-situ  on  the  bottom 
hull  of  the  SeaBED  AUV  (outer  hydrodynamic  shell  removed). 


DVL  Navigation  and  its  Characteristics 


The  beam  component  velocity  measurements  can  be  mapped  to  a  standard  Cartesian  fixed 
instrument  frame  by  the  static  4x4  instrument  transformation  matrix  M  parameterized  by 
the  transducer  geometry  [131]: 


v  sensor  (0  — 


vs,(t) 

e(0  J 


—  Mvbcam(f). 


The  XYZ  components  of  v^^  correspond  to  the  Cartesian  components  of  the  bottom 
referenced  velocity  vector  as  expressed  in  the  instrument  reference  frame,  while  e(t)  is  a 
normalized  least-squares  measure  of  velocity  error.  Discarding  the  error  term  e(f),  the 
resulting  3-vector  of  instrument  frame  velocities,  v^ensor(t),  can  be  rotated  into  a  locally- 
level  coordinate  frame  aligned  with  the  world  frame: 


vworld(0  5  RVsensor(0» 


via  the  3x3  rotation  matrix  “’R,  which  is  computed  using  measurements  from  onboard 
roll,  pitch,  and  heading  sensors.  These  navigation  frame  velocities  can  then  be  integrated 
to  obtain  a  dead- reckoned  bottom  track  DVL  position  [170]: 

Xv*xld(0  =  x(t0)  +  [  v™odd(r)dr- 
Jlo 
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While  the  dead-reckoning  integration  can  be  performed  internal  to  the  DVL  using  its  onboard 
tilt  and  magnetic  flux  gate  compass  for  orientation,  it  typically  is  computed  in  conjunction 
with  the  vehicle’s  orientation  sensors  for  better  precision.  For  these  setups,  the  error  depen¬ 
dence  in  the  integrated  vehicle  position  can  be  less  than  1%  of  total  distance  traveled  [16] . 
In  [170),  Whitcomb  et  al.  define  the  dominant  error  source  in  DVL  navigation  to  be  heading 
sensor  inaccuracy.  They  show  that  when  augmented  with  absolute  position  12  kHz  LBL 
(which  has  error  bounds  on  the  order  of  0.1-10  m)  via  complementary  linear  filtering  — 
bounded  error  estimates  approaching  centimeter  accuracy  can  be  achieved  [169]. 

1.3.3  Visually-Based  Methods 

Underwater  vehicles  are  commonly  outfitted  with  vision  sensors  for  biological  [133, 145], 
geological  [66,153,174,175],  and  archaeological  [5,144,178]  survey  needs.  As  such,  they  have 
become  standard  equipment  onboard  submersibles.  As  a  readily  available  sensor,  vision  can 
be  incorporated  into  a  navigation  framework  to  provide  alternative  vehicle  motion  estimates 
when  working  near  the  seafloor  in  relatively  clear  water.  Over  the  past  decade,  a  number  of 
techniques  have  been  proposed  within  the  context  of  real-time  underwater  navigation  and 
station-keeping  with  one  of  the  earliest  attempts  being  the  preliminary  work  of  Aguirre  [1]  in 
collaboration  with  IFREMER  (the  French  Institute  for  the  Research  and  Exploitation  of  the 
Sea)  conducted  during  the  late  1980’s.  This  work  involved  the  calculation  of  vehicle  motion 
from  a  250  image  underwater  video  sequence  using  integrated  frame-to-fraine  translational 
motion  estimates  to  provide  a  dead-reckoned  measure  of  vehicle  position  over  a  small  area 
(a  few  meters  in  size). 

MBARI/Stanford 

The  research  group  at  Monterey  Bay  Aquarium  Research  Institute  (MBARI)/Stanford 
(headed  by  Steve  Rock)  further  developed  the  role  of  vision  as  a  navigation  sensor  [45, 
46,84,98,99].  For  his  dissertation  research,  Fleischer  [44,46]  developed  a  real-time  visual 
navigation  method  that  exploited  spatial  image  constraints.  His  strategy  for  reducing  vi¬ 
sual  drift  was  to  explicitly  incorporate  pairwise  constraints  from  cross-over  points  (i.e., 
places  where  the  vehicle’s  trajectory  crossed  back  over  itself)  to  constrain  navigation  error. 
Fleischer’s  algorithm  was  based  upon  a  visual  map  representation  where  a  collection  of  key 
image  frames  were  stored  and  used  to  represent  places  the  vehicle  had  previously  visited 
(similar  in  concept  to  the  pose-constraint  network  formulation  of  SLAM  by  Lu-Milios  [94]). 
Upon  revisiting  one  of  these  areas,  image  registration  between  the  current  view  and  any 
of  the  past  views  was  used  to  provide  2D  translational  constraints  between  the  associated 
camera  poses.  For  an  estimation  framework,  he  proposed  the  use  of  either  an  augmented 
state  Kalman  filter  or  a  standard  Unear  least-squares  batch  formulation  (in  practice  the 
batch  formulation  was  used).  Though  Fleischer  did  not  expUcitly  relate  his  work  to  the 
SLAM  Uterature,  it  was  definitely  in  the  same  vein.  Limitations  of  the  approach,  however, 
are  the  overly  simplistic  2D  translation-only  image  registration  model  and  the  non-scalable 
estimation  framework.1 

*To  satisfy  the  assumptions  of  image-based  correlation,  the  vehicle  was  actively  controlled  to  maintain  a 
fixed  heading  (using  compass  readings)  and  flew  only  over  flat  portions  of  seafloor  at  a  constant  altitude. 
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Figure  1-17  A  mosaic- based  navigation  strategy.  The  vehicle  navigates  with  respect  to  the  mosaic 
by  registering  its  current  view  to  the  mosaic,  thereby  achieving  a  bounded  error  pose  estimate. 
Pitfalls  of  this  method  are:  1)  mosaic  construction  is  ill- posed  over  non-planar  terrain  and  2)  it  is 
an  awkward  mathematical  framework  for  fusing  other  sensor-based  navigation  measurements. 


University  of  Miami 

Simultaneously,  Negahdaripour  et  al.  [114-116]  at  the  University  of  Miami,  Florida  devel¬ 
oped  a  mosaic- based  navigation  (MBN)  strategy  for  video  imagery  using  an  optical  flow 
method  founded  upon  the  generalized  BCC  [113,  117]  (similar  to  the  BCC  constraint  of 
§1.2.1,  but  additionally  accounting  for  affine  varying  illumination).  Fig.  1-17  illustrates 
the  concept  behind  the  MBN  strategy.  Their  differential  formulation  uses  spatio-temporal 
image  gradients  to  measure  inter-frame  vehicle  motion  directly,  then  integrates  over  time 
to  provide  an  estimate  of  vehicle  position.  To  reduce  the  drift  rate  of  the  navigation  es¬ 
timate,  the  authors  offer  two  methods.  The  first  is  based  upon  trying  to  calculate  the 
biases  associated  with  their  direct  method  in  an  attempt  to  improve  the  inter-image  motion 
estimate  and  thereby  reduce  drift  —  this  strategy  is  analogous  to  using  a  better  inertial 
navigation  unit  (INU)  (i.e.,  the  associated  measurement  errors  are  reduced,  but  not  elimi¬ 
nated).  Hence,  it  does  not  avoid  the  undesired  characteristic  of  unbounded  error  growth. 
Their  second  modification  attempts  to  address  unbounded  error  growth  by  using  the  mosaic 
itself  to  help  constrain  position  drift.  Correction  of  errors  in  position  and  orientation  are 
made  each  time  the  mosaic  is  updated,  which  occurs  every  Llh  video  frame.  They  use  their 
current  position  estimate  P[k]  to  extract  the  mosaic  region  A I[k]  where  the  current  image 
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I[k  +  1]  is  hypothesized  to  map  to,  producing  the  estimated  image  I[k  +  1].  The  motion 
error  estimate  computed  from  I[k  + 1]  and  I\k  +  1]  is  then  used  as  a  feedback  to  correct  the 
current  position  P[k  +  1]  and  update  the  mosaic  [116].  With  this  method,  error  growth  is 
now  constrained  to  the  accuracy  of  the  mosaic. 

Instituto  Superior  Tecnico 

Finally,  the  work  of  Gracias  et  al.  [56-58]  at  the  Instituto  Superior  Tecnico,  Portugal 
also  approaches  the  problem  of  local  AUV  navigation  in  a  mosaic-based  manner.  In  their 
MBN  approach,  the  video  mosaic  is  first  generated  offline,  then  used  online  for  real-time 
navigation.  The  offline  mosaic  is  constructed  in  a  globally  consistent  manner  taking  into 
account  spatial  pairwise  constraints,  and  under  the  assumption  that  the  imaged  terrain 
is  approximately  planar,  their  method  uses  the  mosaic  map  for  navigation.  They  do  this 
by  decomposing  image-to-mosaic  homographies  into  camera  poses  using  the  world  plane 
decomposition  of  Faugeras  and  Lustman  [41].  While  their  approach  has  been  successfully 
demonstrated  for  small  areas  (i.e.,  a  navigable  area  covering  approximately  64  m2  with  a 
10.8  mx9.5  m  bounding  box  [58]),  it  does  not  scale  well  to  large  areas  because  it  suffers  from 
inconsistencies  associated  with  generating  a  single,  large,  internally  consistent  mosaic  map 
that  meets  the  assumption  of  an  extended  planar  scene.  In  addition,  another  drawback  of 
their  method  is  that  it  is  a  purely  vision  based  framework  awkward  for  fusing  other  sensor 
based  navigation  measurements. 

1.3.4  Limitations  of  Current  Approaches 

Acoustic  transponder  navigation  systems  offer  bounded  position  measurements,  but  at  high 
deployment  costs  both  in  the  context  of  ship  deployment  time  and  equipment  costs. 
Additionally,  such  systems  limit  vehicle  navigation  to  within  the  deployed  network,  which  is 
not  amenable  to  conducting  multiple  short-duration  exploratory  surveys  over  different  sites. 
Alternatively,  the  recent  advent  of  the  DVL  reduces  the  need  for  additional  infrastructure 
by  improving  the  dead-reckon  navigation  capability  of  near-seafloor  vehicles.  However,  the 
open-loop  nature  of  these  systems  implies  that  error  is  unbounded  as  a  function  of  distance 
traveled. 

Turning  to  more  recent  underwater  visually-based  methods,  the  dominant  approach  has 
been  the  mosaic-based  navigation  strategy.  While  current  MBN  implementations  have  been 
demonstrated  to  have  practical  application  over  small,  relatively  flat,  areas  (hundreds  of 
square  meters),  the  following  limitations  make  MBN  unsuitable  for  large-area  navigation: 

1.  Current  approaches  are  founded  solely  from  a  computer  vision  standpoint  (i.e.,  all 
motion  estimates  are  derived  from  imagery  without  incorporating  additional  motion 
information  from  available  vehicle  navigation  sensors).  While  this  is  interesting  from 
a  theoretical  basis,  the  problem  begs  to  be  formulated  within  an  estimation  framework 
that  fuses  all  available  vehicle  information  to  produce  an  optimal  solution. 

2.  Construction  of  a  single  large  mosaic  is  ill-conditioned  when  the  seafloor  deviates 
from  the  planar  assumption  (Fig.  1-18).  A  piecewise  planar  submapping  strategy 
could  possibly  be  employed  in  an  attempt  to  salvage  MBN.  However,  highly  3D 
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Figure  1-18  Failure  of  a  MBN  strategy  lies  in  the  ill-posed  construction  of  a  single  large  mosaic 
over  non-planar  terrain.  Here  we  see  bathymetry  from  a  deep-water  archaeological  site  showing  an 
amphora  pile  sitting  in  a  small  depression  and  that  construction  of  a  mosaic  over  the  same  pile  fails 
due  to  3D  relief  [149]  (the  circle  designates  the  same  area  in  both  modalities). 
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(a)  Bathymetry  map  over  a  deep-water  amphora  pile  (b)  The  mosaic  diverges  as  a  result  of  3D  terrain  ef- 
(units  are  in  meters).  fects  violating  the  planar  seafloor  assumption. 


environments  (such  as  surveys  over  coral  reefs  [145]  where  mosaicking  assumptions 
are  severely  violated)  would  remain  inapplicable. 

3.  Low  overlap  imagery  is  common  for  AUVs  since  they  are  power  limited  and  cannot 
afford  the  continuous  illumination  necessary  for  video  frame  rates  [12, 152].  Hence, 
image  registration  is  much  more  difficult  in  this  scenario  since  the  inter-image  motion 
may  be  large.  This  implies  that  direct  methods  and  video-rate  techniques  are  inap¬ 
plicable  because  they  assume  incremental  (i.e.,  high  overlap)  camera  motion  between 
frames. 

These  limitations,  in  conjunction  with  the  limitations  of  traditional  LBL  and  DVL  methods, 
have  been  the  impetus  for  the  integrated  systems-level  visual  approach  adopted  in  this 
thesis. 


1.4  Thesis  Outline 

In  this  section  we  outline  our  systems-level  approach  to  visually- based  navigation  and  the 
order  in  which  material  is  presented  in  this  thesis  document. 

1.4.1  Approach 

This  thesis  answers  the  question  of  how  to  achieve  real-time,  scalable,  bounded  error,  6- 
DOF,  precision  navigation  for  near-seafloor  AUVs  in  rugged-terrain  via  the  incorporation 
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of  camera-derived  motion  estimates.  Unlike  prevailing  MBN  methods  described  above,  this 
thesis  adopts  a  systems-level  approach  to  using  visual  navigation  underwater,  termed  “visu¬ 
ally  augmented  navigation  (VAN)”  [36].  Specifically,  VAN  casts  visual  sensor  fusion  within 
a  stochastic  view-based  SLAM  framework  amenable  to  the  peculiarities  of  low-overlap  un¬ 
derwater  imagery.  VAN  combines  pairwise  image-derived  relative-pose  spatial  constraints 
with  more  traditional  navigation  sensor  measurements  (e.g.,  attitude,  Doppler  velocities, 
depth)  to  recover  a  bounded  estimate  of  the  vehicle’s  6-DOF  trajectory.  In  this  context, 
VAN  embraces  the  stochastic  framework  of  SLAM  while  addressing  the  practical  issue  of 
“features”  and  map  representation  in  an  unstructured  underwater  environment  observed 
through  low-overlap  imagery.  Rather  than  detecting  and  tracking  particular  features  over 
time,  overlapping  image  pairs  are  registered  in  a  wide-baseline  epipolar  framework  to  re¬ 
cover  the  relative  orientation  and  baseline  direction  between  camera  poses.  These  pairwise 
image-derived  spatial  constraints  effectively  allow  virtual  observation  of  the  current  vehicle 
pose,  xv,  relative  to  any  other  pose,  x7,  with  common  scene  overlap  resulting  in  a  view- 
based  map  of  the  world  where  the  observation  model  is  of  the  form  h(xv,  Xj)  —  thus,  vehicle 
poses  effectively  become  virtual  “landmarks”  [36,87]. 

1.4.2  Document  Roadmap 

The  material  presentation  is  broken  into  the  following  chapters. 

•  Chapter  2  lays  the  foundation  of  an  estimation  framework  suitable  for  fusing  low- 
overlap  pairwise  image-derived  constraints.  Additionally,  we  present  a  systems-level 
image  registration  strategy  that  exploits  the  available  information  from  a  propriocep¬ 
tive/exteroceptive  pose-instrumented  robotic  platform. 

•  Chapter  3  discusses  the  “information  formulation”  of  the  feature-based  SLAM  pos¬ 
terior.  The  recent  popularity  of  this  representation  stems  from  its  “almost  sparse” 
structure.  In  particular,  a  number  of  recent  large-area  SLAM  algorithms  have  been 
derived  by  enforcing  sparsity  in  this  representation.  We  explore  the  consequences  of 
this  approximation  and  provide  new  insight  as  to  why,  and  how,  this  can  lead  to  filter 
inconsistency. 

•  Chapter  4  presents  the  novel  insight  that  the  information  formulation  can  be  made 
exactly  sparse  without  any  approximation  by  using  a  view-based  SLAM  representa¬ 
tion.  This  allows  us  to  scale  the  estimation  framework  of  Chapter  2  to  very  large 
environments  achieving  O(n)  complexity  by  exploiting  the  sparse  representation.  Ad¬ 
ditionally,  we  present  a  novel  algorithm  for  data  association  —  something  that  had 
previously  been  an  open  research  issue  in  the  information  form. 

•  Chapter  5  provides  a  summary  of  contributions,  algorithm  failure  modes,  and  sug¬ 
gestions  for  future  work. 

•  Appendix  A  describes  in  detail  our  robot  platform  and  the  various  analytical  models 
used  to  describe  it. 

•  Appendix  B  provides  accompanying  derivations  for  the  work  presented  in  Chapter  3. 
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CHAPTER  2 


Visually  Augmented  Navigation 


f  \  S  autonomous  underwater  vehicles  are  used  more  routinely  in  an  exploratory  context 
"  %  for  ocean  science,  the  goal  of  visually  augmented  navigation  is  to  improve  the  near- 
JL.  seafloor  navigation  precision  of  such  vehicles  without  imposing  the  burden  of  having 
to  deploy  additional  infrastructure.  This  is  in  contrast  to  traditional  acoustic  long  baseline 
navigation  techniques,  which  require  the  deployment,  calibration,  and  eventual  recovery  of 
a  transponder  network.  To  achieve  this  goal,  VAN  is  framed  within  a  vision-based  simul¬ 
taneous  localization  and  mapping  framework  that  exploits  the  systems-level  complemen¬ 
tary  aspects  of  a  camera  and  strap-down  sensor  suite  to  create  a  bounded-error  navigation 
technique  that  is  robust  to  the  peculiarities  of  low-overlap  underwater  imagery.  It  uses  a 
view-based  representation  where  camera-derived  relative-pose  measurements  provide  spatial 
constraints,  which  enforce  trajectory  consistency  and  also  serve  as  a  mechanism  for  loop- 
closure  (Fig.  2-1).  This  chapter  outlines  the  multi-sensor  VAN  framework  and  demonstrates 
it  to  have  compelling  advantages  over  a  purely  vision-only  based  approach  by  1)  improving 
the  robustness  of  low-overlap  underwater  image  registration,  2)  setting  the  free  gauge  scale, 
and  3)  allowing  for  a  disconnected  camera  constraint  topology. 

2.1  Introduction 

From  exploring  abandoned  mines  in  Pennsylvania  [159],  to  exploring  other  planets  in  our 
solar-system  [26],  robotic  exploration  of  remote  environments  extends  our  reach  to  areas 
where  human  exploration  is  considered  too  dangerous,  too  technically  challenging,  or  both. 
While  high  profile  missions  like  the  2003  Mars  rovers  epitomize  the  lengths  that  we  will 
go  to  in  search  of  new  origins  of  life,  it  cannot  be  overstated  that  exploring  the  deep- 
abyss  of  our  own  oceans  can  be  nearly  as  alien  and  offer  just  as  startling  discoveries  about 
how  life  began.  Though  manned  vehicles  like  Alvin  [2, 31]  have  been  responsible  for  many 
of  the  most  important  deep-science  discoveries  [4,25],  the  extreme  design  requirements, 
operational  costs,  risk  of  life,  and  limited  availability  prevent  its  ubiquitous  use.  Therefore, 
out  of  necessity  the  deep-sea  has  become  an  arena  where  the  presence  of  mobile  robotics  is 
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Figure  2-1  The  objective  of  VAN  is  the  real-time  fusion  of  “zero-drift”  camera  measurements  with 
navigation  sensor  data  to  close- the- loop  on  dead-reckoned  error.  For  this  purpose  VAN  adopts  a  top- 
down  systems-level  approach  to  visual  navigation.  It  uses  a  view-based  representation  founded  upon 
registering  raw  imagery  to  generate  pairwise  camera  constraints  that  are  then  fused  with  navigation 
sensor  data  in  a  delayed-state  extended  Kalman  filter  framework. 


VAN  Estimate 


Navigation  Sensors 


pervasive  and  their  scientific  utility  revolutionary  [6, 141, 145, 176]. 

While  underwater  mobile  robotics  have  made  significant  inroads  into  mainstream  science 
over  the  past  two  decades,  a  limiting  technological  issue  to  their  widespread  utility,  especially 
for  exploration,  is  the  lack  of  easily  obtainable  precision  navigation.  With  the  advent  of  GPS 
many  surface  and  air  vehicle  applications  are  able  to  easily  obtain  their  position  anywhere  on 
the  globe  with  precision  of  a  few  meters  via  the  triangulation  of  satellite  transmitted  radio 
signals  —  unfortunately,  these  radio  signals  do  not  penetrate  sub-sea  [172]  (nor  underground 
[159],  or  even  indoors  [19]).  Therefore,  as  discussed  in  Chapter  1,  traditional  underwater 
navigation  strategies  have  been  to  deploy  an  acoustic  version  of  GPS  using  seafloor  tethered 
beacons  to  relay  time-of-flight  range  measurements  for  triangulated  positioning.  However, 
the  cost,  complexity,  and  limitations  of  an  infrastructure  dependent  solution  leave  much  to 
be  desired,  which  is  further  complicated  by  the  fact  that  alternative  strap-down  solutions 
suffer  from  a  position  drift  that  grows  unbounded  with  time. 

Over  the  past  decade,  a  big  research  push  within  the  terrestrial  mobile  robotics  commu¬ 
nity  has  been  to  develop  environmentally- based  navigation  algorithms,  which  eliminate  the 
need  for  additional  infrastructure  and  bound  position  error  growth  to  the  size  of  the  envi¬ 
ronment  —  a  key  prerequisite  for  truly  autonomous  navigation.  The  basis  of  this  work  has 
been  to  exploit  the  perceptual  sensing  capabilities  of  robots  to  ‘‘beat-down”  accumulated 
odometric  error  by  localizing  the  robot  with  respect  to  ‘‘landmarks”  in  the  environment. 
The  question  of  how  to  use  such  a  methodology  for  navigation  and  mapping  was  first  the¬ 
oretically  addressed  in  a  probabilistic  framework  in  the  mid  1980’s  with  seminal  papers  by 
Smith,  Self,  and  Cheeseman  [154]  and  Moutarlier  and  Chatila  [110],  which  have  since  then 
become  the  cornerstone  of  the  research  field  known  as  SLAM. 

One  of  the  major  challenges  of  a  SLAM  methodology  is  that  defining  what  constitutes 
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a  feature  from  raw  sensor  data  can  be  nontrivial.  In  man-made  structured  environments, 
typically  composed  of  planes,  lines  and  corners  primitives,  features  can  be  more  easily 
defined  [158].  However,  unstructured  outdoor  environments  can  pose  a  more  challenging 
task  for  feature  extraction  and  matching,  which  has  lead  to  “scan-matching”  [94]  based 
approaches  that  do  not  require  an  explicit  representation  of  features.  These  view-based 
techniques  have  traditionally  been  used  with  accurate  perceptual  sensors  such  as  laser  range 
finders  where  raw  data  can  be  “matched”  directly  in  an  iterative  closest  point  sense.  Along 
these  lines,  our  underwater  approach  is  to  use  a  camera  as  an  accurate  and  inexpensive 
perceptual  sensor  to  collect  near-seafloor  imagery  that  can  also  be  “matched”  directly. 
Motivation  for  such  an  approach  comes  from  the  fact  that  typical  AUV  imagery  has  low 
temporal  overlap  (to  minimize  illumination  power  consumption  [12]),  which  implies  that 
3D  features  in  the  environment  are  not  observed  for  more  than  a  few  views.  Such  a  low- 
overlap  constraint  implies  that  a  view-based  representation  is  particularly  suitable  for  this 
type  of  data  since  overlapping  image  pairs  can  be  registered  directly  in  a  pairwise  fashion  to 
extract  “zero-drift”  relative-pose  modulo  scale  measurements  without  explicitly  representing 
3D  feature  points.  In  this  way,  registering  an  image  taken  from  time  U  to  an  image  taken  at 
time  tj  provides  a  spatial  constraint  whose  error  is  bounded  regardless  of  time  or  distance 
traveled  between  the  two  views. 

In  the  rest  of  this  chapter  we  present  our  framework  and  methodology  for  incorporating 
camera-derived  relative-pose  measurements  with  vehicle  navigation  data  in  a  view-based 
SLAM  context.  In  particular,  §2.2  describes  our  assumptions  while  §2.3  presents  a  delayed- 
state  extended  Kalman  filter  (EKF)  framework  for  fusing  camera  measurements  that  also 
serves  as  a  foundation  for  probabilistic  link  hypothesis.  In  §2.4  we  then  explain  how  we 
actually  make  the  pairwise  camera  measurements  using  a  systems-level  feature-based  image 
registration  approach.  We  show  that  a  multi-sensor  approach  has  compelling  advantages 
over  a  camera-only  based  navigation  system  and  in  particular  that  it  improves  registration 
robustness  via  a  pose-constrained  correspondence  search.  Results  are  presented  in  the 
context  of  a  real-world  dataset  collected  by  an  AUV  in  a  rugged  undersea  environment,  and 
for  tank  data  collected  by  a  ROV  for  which  ground-truth  was  available. 

2.2  Assumptions 

Our  application  is  based  upon  using  a  pose  instrumented  AUV  equipped  with  a  single 
down-looking  calibrated  camera  to  perform  underwater  imaging  and  mapping  [146, 147]. 
The  vehicle  makes  acoustic  measurements  of  both  velocity  and  altitude  relative  to  the 
seafloor.  Absolute  orientation  is  measured  to  within  a  few  degrees  over  the  entire  survey 
area  via  inclinometers  and  a  flux-gate  magnetic  compass.  Bounded  positional  estimates  of 
depth,  Z,  are  provided  by  a  pressure  sensor.  A  detailed  platform  discussion  can  be  found 
in  §A.l,  however,  for  convenience  Table  2.1  provides  a  short  summary  of  assumed  sensor 
characteristics.  In  brief  we  assume: 

•  An  ideal,  calibrated  camera. 

•  An  instrumented  platform. 

•  Known  reference  frames  (i.e.,  vehicle  to  camera,  vehicle  to  sensor). 

•  Pairwise  registration  to  accommodate  low-temporal  overlap. 
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Table  2.1  Typical  pose  sensor  characteristics  for  underwater  platforms. 


Measurement 

Sensor 

Precision 

Roll/Pitch 

Tilt  Sensor 

±0.5° 

Heading 

Magnetic  Flux  Gate  Compass 

±2.0° 

3- Axis  Angular  Rate 

AHRS 

±1.0°/s 

Body  Frame  Velocities 

Acoustic  Doppler 

±1-2  mm/s 

Depth 

Pressure  Sensor 

±0.01% 

Altitude 

Acoustic  Altimeter 

±0.1  m 

2.3  View-Based  SLAM  Estimation  Framework 


Typical  structure-from- motion  approaches  [8,27,104,163,181]  estimate  both  camera  motion 
and  3D  scene  structure  from  a  sequence  of  video  frames.  However,  in  our  application  the 
low  degree  of  temporal  image  overlap  (typically  on  the  order  of  35%  or  less)  motivates  us 
to  focus  on  recovering  pairwise  measurements  from  spatially  neighboring  image  frames.  In 
this  framework,  the  camera  provides  measurements  of  the  6-DOF  relative  coordinate  trans¬ 
formation  between  poses  modulo  scale.  These  measurements  are  used  as  constraints  in  a 
recursive  estimation  framework  that  tries  to  determine  the  global  poses  consistent  with  the 
camera  measurements  and  navigation  prior  as  shown  in  Fig.  2-2.  These  global  poses  corre¬ 
spond  to  samples  from  the  robot’s  trajectory  at  the  times  associated  with  image  acquisition. 
Thus,  unlike  the  typical  feature-based  SLAM  estimation  problem,  which  keeps  track  of  the 
current  robot  pose  and  an  associated  landmark  map,  the  VAN  state  vector  consists  entirely 
of  delayed  vehicle  states  corresponding  to  the  vehicle  poses  at  the  times  the  images  were 
captured.  This  delayed-state  approach  corresponds  to  a  view-based  representation  of  the 
environment,  which  can  be  traced  back  to  a  batch  scan-matching  method  by  Lu  and  Mil- 
ios  [94]  using  laser  data,  a  delayed  decision  making  framework  by  Leonard  and  Rikoski  [87] 
for  feature  initialization  with  sonar  data,  and  the  hybrid  batch/ recursive  formulations  by 
Fleischer  [44]  and  McLauchlan  [105]  using  camera  images.  In  this  context,  scan-matching 
raw  images  results  in  virtual  observations  of  robot  motion  with  respect  to  a  place  it  has 
previously  visited. 


Figure  2-2  A  view-based  representation  consists  of  a  network  of  navigation  and  camera  constraints 
over  a  collection  of  time-delayed  vehicle  poses  associated  with  the  images  in  our  view-based  map. 

camera 


camera 


camera 
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2.3.1  Delayed-State  Extended  Kalman  Filter 

We  begin  by  describing  our  representation  of  vehicle  state  and  a  general  system  model  for 
state  evolution  and  observation.1  This  model  is  used  as  the  basis  for  trajectory  estimation 
within  the  context  of  an  EKF  [54].  We  then  show  how  to  incorporate  camera-derived 
relative- pose  measurements  by  augmenting  our  state  representation  to  include  a  history  of 
delayed-state  vehicle  poses. 


Fixed-Size  State  Description 

The  vehicle  state  vector,  xv,  contains  both  pose  and  kinematic  terms,  xp  and  x*  respectively, 
and  is  defined  as 

x„=[xj,  xJ]T 

Here  xp  is  a  6- vector  of  vehicle  pose  in  the  local-level  navigation  frame  where  XYZ  roll  pitch 
heading  Euler  angles  are  used  to  represent  orientation  [48]  (i.e.,  xp  =  X£V  =  [x,  y,  z ,  0, 0,  xp]  ), 
and  x*  represents  any  kinematic  state  elements  that  are  required  for  propagation  of  the 
vehicle  process  model  (e.g.,  body- frame  velocities,  accelerations,  angular  rates).  In  ad¬ 
dition  we  assume  that  the  vehicle  state  can  be  modeled  as  being  normally  distributed, 
xv  ~  Af(fiv,  £vv),  with  mean  and  covariance  given  by 


Hv  =  and  = 


^ pp  SpK 
Sfcp 


The  vehicle  state  evolves  through  a  time-varying  continuous  time  process  model,  f(  • ,  t), 
driven  by  white  noise,  w(t)  ~  A/"(0,  Q(<)),  and  deterministic  control  inputs,  u(t),  while 
discrete  time  measurements  of  elements  in  the  vehicle  state  are  observed  through  an  obser¬ 
vation  model,  h( -,£*),  corrupted  by  time  independent  Gaussian  noise,  v[t*.]  ~  N(0,  R*) , 
with  £[wvT]  =  0.  The  resulting  system  model  is: 


xv(t)  =  f(xv(t),u(t),t)  +  w(t) 
z  [tk]  =  h(x„[tfc],<*)  +  v[tfc]. 

As  is  typical  in  the  navigation  literature,  the  vehicle  state  distribution  is  approximately 
maintained  using  a  continuous-discrete  EKF  [54]  given  by 


Prediction 


Svv(t)  =  FxEvt,(t)  +  T,vv(t)Fx  +  Q(t) 


(2.2) 


Update 


K  =  £wHj(Hx£wHj  +  Rib)-1 
=  K  +  K(z[tfc]  -  h (£„,  <fc))  (2.3) 

=  (I  -  KHX)EW(I  -  KHx)T  +  KRfcKT 


where  Fx  = 


and  H 


.  ah  | 
x  5xUIa„ 


are  the  process  and  observation  model  Jacobians  re- 


1  See  Appendix  A  for  details. 
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spectively.  In  this  formulation,  the  predicted  vehicle  distribution,  xv  ~  N(fiv,  Evv),  is 
computed  between  asynchronous  sensor  measurements  by  solving  (2.2)  via  a  fourth-order 
Runge-Kutta  numerical  integration  approach  [119]. 

Unfortunately,  the  fixed-size  state  description,  x„,  does  not  allow  us  to  represent  our 
pairwise  camera  constraints.  This  is  because  registration  of  an  image  pair  results  in  a 
relative-pose  measurement  modulo  scale,  and  not  an  absolute  observation  of  elements  in 
vehicle  pose,  xp.  Therefore,  before  we  can  incorporate  pairwise  camera  constraints,  we 
have  to  first  augment  our  state  representation  to  include  a  history  of  vehicle  poses  where 
each  delayed-state  entry  corresponds  to  an  image  in  our  view-based  map.  Under  this  rep¬ 
resentation,  the  distribution  we  are  trying  to  estimate  is  p(£t  |z*,  u*)  where  zl  represents  all 
measurements  up  to  time  t  (including  camera  and  navigation  sensors),  u*  is  the  set  of  all 
control  inputs,  and  £t  is  our  view-based  SLAM  state  vector  (note  that  initially  £t  =  xv). 
Next,  we  describe  the  process  of  how  delayed-states  are  added  to  our  “map”. 


Augmenting  our  State  Description  with  Delayed-States 


At  time  t\  corresponding  to  when  the  first  image  frame,  7i,  is  captured,  we  augment  our 
state  description,  £t,  to  include  the  vehicle’s  pose  of  where  it  was  when  it  took  that  image 
(i.e.,  =  [xj,  xJJT).  Therefore,  at  this  time  instance  the  augmented  state  distribution, 

~  £«),  is  given  by 


=  Wi]T.  /y*i]T]T  =  W.  ^J.]T 


^ vv  [^l] 

£vpl 

.Oil 

Eppitij. 

P\V 

^P1P1_ 

This 

map 


process  is  repeated  for  each  camera  frame  that  we  wish  to  keep  in  our  view-based 
so  that  after  augmenting  n  delayed  states  (one  for  each  camera  frame)  we  have  £t  = 

•••  >  xJ„]T  with 


^ vv 

Evpi 

*  *  ^ Vpn 

Mt  = 

and  E  t  — 

Spiv 

Spipi 

^PlPn 

Mpn. 

^ PnV 

^PnPl 

^PnPn  _ 

Note  that  in  (2.4)  the  vehicle’s  current  pose  xp  is  fully  correlated  with  xPl  by  definition. 
Therefore,  when  the  nth  delayed-state,  xPn,  is  augmented  in  (2.5),  its  cross-correlation  with 
the  other  delayed  poses  in  E*  is  non- zero  since  the  current  vehicle  state  has  correlation  with 
each  delayed-state. 

The  system  model  (2.1)  must  be  also  be  extended  to  incorporate  the  augmented  state 
representation.  For  the  process  model  the  only  required  change  is  that  xv  continues  to 
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evolve  through  the  vehicle  dynamic  model,  f(  •  ,  £),  while  the  delayed-state  entries  do  not: 


x„' 

'f(x„(t),u(t),t)  +  w(t)' 

d 

Xpi 

06x1 

dt 

_XPn. 

- 1 

...  X 
<o 
o 

_ 1 

Similarly,  navigation  sensor  observation  models  continue  to  remain  a  function  of  only  the 
current  vehicle  state,  xw,  which  results  in  sparse  Jacobians  of  the  form 

H^  =  [Hx>  0mx6>  "  *  j  0mx6] 


where  m  is  the  dimension  of  the  measurement.  However,  in  the  case  of  camera-derived 
measurements  the  observation  model  becomes  a  function  of  delayed-states  entries  as  is 
discussed  next. 


2.3.2  Pairwise  Camera  Observation  Model 

Pairwise  image  registration  from  a  calibrated  camera  has  the  ability  to  provide  a  measure¬ 
ment  of  relative-pose  modulo  scale  between  delayed-state  elements  xPi  and  xPj  ,  provided 
images  I{  and  Ij  have  overlap.  In  deriving  the  camera  observation  model  we  use  the  familiar 
Smith,  Self,  and  Cheeseman  notation  [154]  described  in  §A.2,  and  assume  that  the  camera 
to  vehicle  static  pose  xvc  is  known. 


Camera  Relative  Pose 


The  delayed-state  entries  xPi  and  xPj  correspond  to  vehicle  poses  X£Vi  and  X£Vj  as  represented 
in  the  local-level  navigation  frame  respectively.  Hence,  using  the  static  camera  to  vehicle 
pose,  xvc,  we  can  express  the  transformation  from  camera  frame  i  to  j  using  the  tail-to-tail 
operation  as 


xCjct  =  ©x*c .  ©  *eCi 

=  e(x£Vj.  ©  Xvc  )  ©  (xiVi  ©  Xvc) 


with  Jacobian 


cbc, 


CJCl 


dxcjci  d(x*Cj,  XfcJ 


Jcjct  d(xevj ,  X(Vt )  d(xeCj,xeci)  d(x(Vj,  x(Vt) 


chain-rule 


—  eJ© 


(x<vx<c<) 


06x6 

°6x6  J©l|(x<t,.,x„c) 


(xtVj,xvc) 


(2.6a) 

(2.6b) 


(2.7a) 


(2.7b) 


5-DOF  Camera  Measurement 

However,  note  that  what  the  camera  actually  measures  is  not  the  6-DOF  relative  pose 
measurement  of  (2.6),  but  rather  only  a  5-DOF  measurement  due  to  loss  of  scale  in  the 
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image  formation  process.  This  loss  of  scale  implies  that  only  the  baseline  direction,  as 
represented  by  azimuth  and  elevation  angles  ctji  and  fiji  respectively,  is  recoverable  from 
image  space.  Realizing  that  the  relative-pose  measurement  xCjCi  is  parameterized  by 

XcjCi  =  l  3  tcjd  »  ®CjcJ  =  [xji »  Vji »  zji »  4*3%  j  @ji  >  iftji]  > 
we  can  express  the  bearing-only  baseline  measurement  of  aji  and  (3ji  as 


ocji  —  atan2(t/ji ,  Xji) 


with  Jacobian2 


 d{ckji}  Pji) 


-Vji 

(xji+y%) 

-ZjiXji 


pji  =  atan2 (z-ju  (x]{  +  y£)s) 


TZ~+yf~) 


3x»+rfi)1/2i*'ji+Vji+zji)  (X]i+Vji ) 1  /2^'xi +y],+zji ) 
Hence,  the  pairwise  5-DOF  camera  observation  model  becomes 


0 

(*h+V?i)1/2 


zji  —  hji(£t)  —  h Jfi(xpj  ,  Xp.  )  —  [ckjii,  /Jjit,  Oji,  Ipji] 


with  Jacobian 

H<  = 

o-  -o- 

ahj, 

15x7 

...ol 

where 

ahji 

dxCjCi 

02x3 

d(xP],xPt)  dxCjCi  d(xPj ,  xPt) 

.03x3 

13x3 

Fig.  2-3  illustrates  this  pairwise  camera  constraint. 


(2.8) 


What  do  pairwise  camera  measurements  tell  us? 


Now  that  we  have  derived  how  to  model  pairwise  camera  measurements,  it’s  worth  intu¬ 
itively  describing  what  equation  (2.8)  means  in  terms  of  reducing  navigation  error.  First 
of  all,  pairwise  camera  measurements  provide  us  with  a  bearing-only  measurement  of  the 
baseline  between  poses  —  hence,  we  are  dependent  upon  our  navigation  sensors  to  set  the 
free-gauge  scale.  In  our  application  this  scale  is  implicitly  fixed  within  the  EKF  by  bounded 


2Note  that  atan2(j/,x)  =  < 


f  tan-1  ( y/x )  x  >  0,  y  >  0 

tan"1  (y/x)  +  it  x<0,  y>0 


tan-1  (y/x)  x  >  0,  y  <  0 


=  tan'1  (y/x)  4-  constant. 


, tan'1  (y/x)  -  tt  x  <  0,y  <  0 

This  implies  that  =  9  ,  so  that  in  general  if  /  =  f(x)  and  g  =  g(x ),  then 


d atan2(g,  /)  dtan-1(^)  1  1  r-2#\  1  /  fdg  _  <}S_\ 

dx  dx  1  +  (f)2  ^dx _ ^ _ dx  \  /2  4- y2  dx  ^  dx 

product-rule 
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Figure  2-3  The  pairwise  5-DOF  camera  measurement  (i.e.,  relative-pose  modulo  scale). 


measurements  of  depth  z  coming  from  a  pressure  sensor  coupled  with  Doppler  velocities 
that  provide  us  with  an  integrated  measurement  of  XYZ  position.  Secondly,  (2.8)  tells  us 
that  camera  measurements  can  only  reduce  relative  positional  error  components  that  are 
orthogonal  to  the  baseline  motion.  Referring  to  Fig.  2-3  we  see  that  frame  Oi  can  slide 
anywhere  along  the  baseline,  Hji,  without  effecting  our  measure  of  azimuth/elevation.  This 
suggests  that  temporal  camera  measurements  do  very  little  to  reduce  along-track  error 
growth.  Hence,  long  linear  surveys  will  benefit  far  less  from  camera  constraints  than  cross¬ 
over  survey  patterns,  which  have  “loops”  in  their  trajectory  and  result  in  ample  spatial 
constraints.  Finally,  the  nonlinear  bearing-only  constraints  of  (2.8)  imply  that  linearization 
errors  will  be  less  significant  in  the  EKF  if  we  can  maintain  good  map  contact  (e.g.,  typi¬ 
cal  grid-based  surveys  achieve  this)  to  prevent  our  linearization  point  from  “drifting”  too 
far  from  the  truth.  This  also  suggests  that  when  closing  large  loops,  where  the  lineariza¬ 
tion  point  may  be  far  from  the  true  state,  that  we  should  incorporate  the  pairwise  camera 
constraints  in  aggregate  via  some  form  of  triangulation  —  a  technique  commonly  used  for 
feature-initialization  in  bearing-only  SLAM  applications  [88]. 


2.3.3  Link  Hypothesis 

An  essential  task  in  a  view-based  representation  is  the  hypothesis  of  probable  overlapping 
image  pairs.  Since  the  image  registration  process  is  arguably  the  “slowest”  component  in  the 
VAN  framework,  it  is  to  our  advantage  to  feed  the  registration  module  only  likely  candidate 
pairs  so  as  to  not  waste  time  attempting  registration  on  images  that  have  a  low  probability  of 
overlap.  Our  link  hypothesis  strategy  is  based  upon  a  grossly-simplified  ID  model  for  image 
overlap  that  uses  our  state  estimate  and  measured  scene  altitude  (beam-range  measurements 
from  the  DVL)  to  project  image  footprints  onto  a  horizontal  plane  as  shown  in  Fig.  2-4. 
Since  our  imaging  AUV  flies  in  a  closed-loop  bottom-following  mode  for  camera  surveys,  it 
approximately  maintains  a  fixed  altitude  off  of  the  seafloor.  Therefore,  for  simplicity,  we 
compute  pairwise  overlap  using  the  larger  altitude  of  a  camera  pair  (Fig.  2-4(b)). 

Assuming  the  above  mentioned  configuration,  image  percent  overlap,  e,  can  be  defined 
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Figure  2-4  Calculation  of  pairwise  overlap  for  link  hypothesis.  To  simplify  the  calculation  of  image 
overlap,  we  reduce  it  to  a  ID  case  on  a  horizontal  plane  (a).  In  the  illustrations  below  O*  and  Oj 
are  the  camera  centers,  FOV  is  the  field  of  view,  A*  and  Aj  are  the  altimeter  measured  altitudes, 
and  Wj  are  the  computed  ID  image  widths,  and  dij  is  the  Euclidean  baseline  distance  coming 
from  our  state  estimate.  Our  vehicle  approximately  maintains  a  fixed  altitude  over  the  seafloor  (this 
implies  A{  ~  A j),  therefore,  we  simplify  the  calculation  further  by  assuming  the  larger  altitude  for 
both  cameras  as  shown  in  (b). 


yyyyyfyfyfyfyyyy. 

Wi  =  2 A,  tan(iFOV) 


Wj  =  2Aj  tan (i  FOV) 

0>) 


(b) 


as 

f  1  —  0  ^  d  <  Vi ^max 

£  —  /  w  max 

)  0  otherwise 

Here,  d  is  the  Euclidean  distance  between  the  camera  centers,  WVnax  =  2Amax  tan(^FOV) 
is  the  ID  image  width,  Amax  is  the  larger  altitude  of  the  pair,  and  FOV  is  the  camera  field 
of  view.  Under  this  scheme,  we  can  set  thresholds  for  minimum  and  maximum  percent 
image  overlap  to  obtain  constraints  on  camera  distance.  We  can  then  compute  a  first- 
order  probability  associated  with  whether  or  not  the  distance  between  the  camera  pair 
falls  within  these  constraints.  This  calculation  serves  as  the  basis  of  our  automatic  link 
hypothesis  algorithm,  outlined  in  Algorithm  2.1,  where  all  frames  in  our  view-based  map 
are  checked  to  see  whether  or  not  they  could  overlap  with  the  current  robot  view  (i.e.,  linear 
complexity  in  the  number  of  views).  The  k  most  likely  candidates  (k  =  5  in  our  application) 
are  then  sent  to  our  image  registration  module  for  comparison.  While  somewhat  simplistic, 
we  have  obtained  good  results  with  this  approximation  and  it  has  been  the  basis  for  the 
work  presented  in  this  thesis  using  automatically  proposed  links. 


2.4  Generating  the  5-DOF  Camera  Measurement 

Having  presented  a  view-based  estimation  framework  capable  of  incorporating  5-DOF  relative- 
pose  measurements,  we  now  turn  our  attention  to  explaining  how  we  actually  make  the 
pairwise  camera  measurement.  At  the  core  is  a  feature-based  image  registration  engine 
whose  purpose  is  to  generate  pairwise  measurements  of  relative- pose.  Essential  to  this  goal 
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Algorithm  2.1  View- based  link  hypothesis.  Hypothesize  which  images  /*  in  our  view-based  map 
have  a  high  probability  of  overlapping  with  the  current  robot  view  Ir. 


1: 

2: 

3: 

4: 

5: 

6: 

7: 

8: 

9: 

10: 

11: 

12: 

13: 

14: 

15: 

16: 

17: 

18: 


define:  k  {maximum  number  of  candidates  to  return} 

define:  emin  E  [0, 1]  {minimum  percent  overlap} 
define:  emax  6  [0, 1]  {maximum  percent  overlap} 
define:  a  6  [0, 1]  {confidence} 
for  all  Ii  do 
Amax  «“  ma x(Ai,Ar) 

W max  4  2  A  max  tan(iFOV) 

dmin  *  (1  t-max)  '  Wmax 

dmax  4  (1  fmin)  '  Wmal 


extract  from  our  state  the  joint-marginal 


'■Pi 

CPr 


■N{ 


Pi 


l/Vj 


jPiPi 

JPrPi 


-'PiPr 

JPrPrJ 


) 


compute  the  relative  camera  pose  and  its  first-order  statistics  (2. 6), (2. 7) 
using  xCrC>  compute  the  Euclidean  distance  dt  and  its  first-order  statistics: 

di  Where  di  ll^CrCill 

compute  the  probability  Pi  that  4in  <  di  <  dmax: 

Pi  XfcT  (r;  ^  ) dr 

if  Pi  >  a  then 

add  /j  to  the  candidate  set  S 

end  if 
end  for 

sort  candidate  set  S  by  Pi  and  return  up  to  the  k  most  probable  candidates 


is  the  capability  to  cope  with  wide-baseline  registration  for  two  main  reasons. 

1.  Low  overlap  imagery  is  common  in  our  temporal  image  sequences  due  to  the  nature 
of  our  underwater  application.  Therefore,  we  must  be  able  to  deal  with  images  in  the 
temporal  sequence  having  35%  or  less  overlap. 

2.  Loop-closing  and  cross-track  spatial  image  constraints  are  the  greatest  strength  of  a 
VAN  methodology.  It  is  these  measurements  which  help  to  correct  dead-reckoned  drift 
error  and  enforce  recovery  of  a  consistent  trajectory.  Since  wide-baseline  viewpoints 
are  typical  in  this  scenario,  this  condition  would  arise  even  if  temporal  overlap  were 
much  higher  as  with  video-frame  rates. 

Thus,  in  order  to  be  able  to  successfully  handle  wide-baseline  image  registration,  our  ap¬ 
proach  has  been  to  extend  a  typical  state-of-the-art  feature-based  image  registration  frame¬ 
work  (§2.4.1)  to  judiciously  exploit  our  navigation  sensor  capabilities  wherever  possible.  For 
example,  in  §2.4.2  we  show  how  we  can  exploit  absolute  orientation  sensor  measurements  to 
reduce  viewpoint  variability  in  our  feature  encoding,  and  also  obtain  a  good  initialization  for 
pairwise  maximum  likelihood  refinement.  We  also  show  in  §2.4.3  how  we  can  use  our  pose 
prior  and  altitude  measurements  to  improve  the  robustness  of  correspondence  establishment 
via  a  novel  pose-constrained  correspondence  search. 
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Figure  2-5  Illustration  of  a  pinhole  camera  model.  An  intrinsically  calibrated  camera  implies  that 
the  mapping  from  Euclidean  camera  coordinates  to  image  pixel  coordinates  is  known.  The  pinhole 
projective  mapping  from  scene  point  M  to  image  point  m  is  described  in  homogeneous  coordinates 
in  terms  of  a  3  x  4  projection  matrix  P  =  K[R|  t]  where  K  is  the  3x3  upper  triangular  intrinsic 
parameter  matrix  and  R,t  describe  the  extrinsic  coordinate  transformation  from  scene  to  camera 
centered  coordinates  [64].  In  practice  we  must  also  account  for  the  lens  distortion,  which  further 
maps  m  to  m'  [65]. 

Pixel 


2.4.1  Pairwise  Feature-Based  Image  Registration 

Calibrated  Camera  Model 


Within  our  feature-based  framework,  we  assume  a  standard  calibrated  pin-hole  camera 
model  [64]  as  illustrated  in  Fig.  2-5.  This  means  that  the  homogeneous  mapping  from 
world  to  image  plane  can  be  described  by  a  3  x  4  projection  matrix  P  defined  as 

P  =  K[^R|ctCU)] 


where  f„R  and  ct. 


coordinate  frame,  c,  and  K  = 


cw  encode  the  coordinate  transformation  from  world,  w,  to  camera  centered 

Qu  S  U0 1 

is  the  known  3x3  upper  triangular  intrinsic 


0  av  v0 
L  0  0  1  J 

camera  calibration  matrix  with  au  and  av  the  pixel  focal  lengths  in  the  x  and  y  directions 
respectively,  (u0,v0)  is  the  principle  point  measured  in  pixels,  and  s  is  the  pixel  skew. 

Under  this  representation  the  interest  point  with  pixel  coordinates  (u,v)  in  image  /  is 
imaged  as 

u  =  PX  (2.9) 


where  u  =  [ti,u]T  is  the  vector  description  of  (u,  v),  u  =  [uT,  1]T  its  normalized  homoge¬ 
neous  representation,  X  =  [X,Y,Z\r  is  the  imaged  3D  scene  point,  and  X  =  [X  ,  1] 1  its 
normalized  homogeneous  representation.  Note  that  for  all  homogeneous  quantities,  equal¬ 
ity  in  expressions  such  as  (2.9)  is  implicitly  defined  up  to  scale.  The  benefit  of  having  a 
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calibrated  camera  is  that  we  can  “undo”  the  projective  mapping  to  pixel  coordinates  in 
(2.9)  and  instead  work  with  normalized  image  plane  coordinates  as 

X  =  K-1u  =  [' R  |  Hctujx. 

The  implication  of  this  is  that  we  can  now  describe  the  epipolar  geometry  in  terms  of 
the  Essential  matrix  [64]  and  recover  the  5-DOF  camera  pose  from  correspondences.  For 
our  application,  we  obtain  the  intrinsic  calibration  matrix  K  by  calibrating  in  water  using 
Zhang’s  planar  method  [180]  and  employ  Heikkila’s  radial/tangential  distortion  model  [65] 
to  compensate  for  both  lens  and  index  of  refraction  effects. 

Geometric  Feature-Based  Algorithm 

Our  feature-based  registration  algorithm  generally  follows  a  state-of-the-art  geometrical 
computer  vision  approach  as  described  by  Hartley  and  Zisserman  [64]  and  Faugeras,  Luong, 
and  Papadopoulo  [40].  Figures  2-6  and  2-7  illustrate  the  overall  hierarchy  of  our  feature- 
based  algorithm  founded  on: 

•  Extract  a  combination  of  both  Harris  [63]  and  SIFT  [93]  interest  points  from  each 
image.  For  the  Harris  points,  we  exploit  our  navigation  prior  to  apply  an  orientation 
normalization  to  the  interest  regions  by  warping  via  the  infinite  homography  [64],  Hqq, 
and  then  compactly  encode  using  Zernike  moments  [124]. 

•  Establish  putative  correspondences  between  overlapping  candidate  image  pairs  based 
upon  similarity  and  a  pose-constrained  correspondence  search  [36]. 

•  Employ  a  statistically  robust  least  median  of  squares  (LMedS)  [134]  registration  method¬ 
ology  with  regularized  sampling  [179]  to  extract  a  consistent  inlier  correspondence  set. 
For  this  task  we  use  a  6-point  Essential  matrix  algorithm  [125]  as  the  motion- model 
constraint. 

•  Solve  for  a  relative-pose  estimate  using  the  inlier  set  and  Horn’s  relative  orientation 
algorithm  [68]  initialized  with  samples  from  our  orientation  prior  (see  §2.4.2). 

•  Carry  out  a  two-view  MLE  refinement  to  extract  the  5-DOF  relative-pose  constraint 
(i.e.,  azimuth,  elevation,  Euler  roll,  Euler  pitch,  Euler  yaw)  and  first-order  parameter 
covariance  based  upon  minimizing  the  reprojection  error  over  all  inliers  [64]. 

The  remainder  of  this  section  focuses  on  the  more  novel  aspects  of  the  above  approach. 
In  particular,  we  discuss  how  to  exploit  sensor  measured  absolute  orientation  within  a 
feature-based  framework  and  in  addition  how  we  can  use  our  state  estimate  to  constrain 
correspondence  searches. 

2.4.2  Exploiting  Sensor-Measured  Absolute  Orientation 
Infinite  Homography  View  Normalization 

Establishing  feature  correspondences  is  arguably  the  most  difficult  task  in  a  feature-based 
registration  approach  —  this  is  especially  true  for  wide-baseline  registration.  Without 
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Figure  2-6  An  overview  of  the  pairwise  image  registration  engine.  Dashed  lines  represent  additional 
information  provided  by  our  state  estimate,  while  bold  boxes  represent  our  systems-level  extensions 
to  a  typical  feature-based  registration  framework.  Given  two  images  and  Ij ,  we  detect  features 
using  a  combination  of  Harris  and  SIFT  interest  operators.  For  the  Harris  points,  we  exploit  our  navi¬ 
gation  prior  to  orientation  normalize  the  interest  regions  by  warping  via  the  infinite  homography 
For  each  feature,  we  establish  a  putative  match  based  upon  similarity  and  a  novel  pose-constrained 
correspondence  search.  A  6-point  essential  matrix  algorithm  employed  within  a  statistically  robust 
LMedS  strategy  extracts  an  inlier  correspondence  set.  Using  this  set  we  initialize  our  relative-pose 
estimate  using  Horn’s  relative  orientation  algorithm  with  regularized  sampling  from  our  orientation 
prior  and  then  refine  in  a  two- view  bundle  adjustment  step  based  upon  minimizing  the  reprojection 
error  over  all  inliers. 
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Figure  2-7  Typical  output  from  the  pairwise  feature-based  image  registration  module  for  a  temporal 

pair  of  underwater  images.0  The  pose  and  triangulated  3D  feature  points  are  the  final  product  of 
a  two- view  MLE  bundle  adjustment  step.  The  3D  triangulated  feature  points  have  been  gridded 
in  Matlab  to  give  a  coarse  surface  approximation  that  has  then  been  texture  mapped  with  the 
common  image  overlap  (the  baseline  magnitude  is  set  to  the  navigation  prior  for  visualization). 


(a)  Harris  interest  points. 


(d)  MLE  epipolar  geometry. 


°To  aid  visualization,  the  images  have  been 


2.5  - 
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(e)  MLE  relative-pose  and  texture  mapped  scene, 
color  corrected  using  the  algorithm  described  in  [20]. 


(b)  SIFT  feature  points. 


0.5  v 


(c)  Inlier  correspondences. 
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any  knowledge  of  extrinsic  camera  information,  robust  techniques  must  rely  upon  encod¬ 
ing  features  in  a  viewpoint  invariant  way.  For  example,  rotational  and  scale  differences 
between  images  render  simple  correlation  based  similarity  metrics  useless.  Therefore,  to 
overcome  these  limitations,  advanced  techniques  generally  rely  upon  encoding  some  form 
of  locally  invariant  feature  descriptor  such  as  differential  invariants  [140],  generalized  image 
moments  [3,77,124],  and  affine  invariant  regions  [106,139,164].  However,  these  higher-order 
descriptions  also  tend  to  be  computationally  expensive. 

In  the  case  of  an  instrumented  platform  with  absolute  measurements  of  orientation, 
we  can  utilize  sensor-derived  information  to  our  advantage  to  relax  the  demands  of  the 
feature  encoding  while  at  the  same  time  making  it  a  more  discriminatory  metric.  For 
example,  in  our  application  we  use  sensor-derived  absolute  pose  information  to  prewarp  the 
feature  regions  around  our  Harris  interest  points  and  precompensate  for  camera  viewpoint 
orientation.  Since  attitude  sensors  provide  information  on  the  3D  orientation  of  cameras  c 
and  o'  in  a  fixed  reference  frame  w,  we  can  normalize  for  orientation  viewpoint  effects  via 
the  infinite  homography: 

Hqo  =  K^RK-1. 

The  infinite  homography  warps  an  image  taken  from  camera  orientation  c  into  an  image 
taken  from  orientation  d  (note  that  the  center  of  projection  still  remains  at  c).  This 
viewpoint  mapping  is  exact  for  points  at  infinity  where  X  =  [X,  Y,  0, 1],  but  otherwise  can 
be  used  to  compensate  for  viewpoint  orientation  (note  that  scene  parallax  is  still  present). 

We  compute  Hqo  based  upon  our  attitude  estimate  at  image  capture  and  apply  it  as  an 
orientation  correction  to  our  images  when  encoding  the  Harris  features.  As  demonstrated  in 
Fig.  2-8,  this  warp  effectively  yields  a  synthetic  view  of  the  scene  from  a  canonical  camera 
coordinate  frame  aligned  North,  East,  Down.  This  allows  normalized  correlation  to  be  used 
as  a  similarity  metric  between  Harris  points  and  tends  to  work  well  for  temporal  image 
sequences  by  generating  a  high  density  of  matches.  This  scheme  in  concert  with  SIFT 
features  has  proven  to  be  successful  for  obtaining  robust  similarity  matches. 


Horn’s  Relative  Orientation  Algorithm  and  Samples  from  our  Orientation  Prior 


We  can  also  take  advantage  of  our  absolute  orientation  prior  by  obtaining  an  initial  relative- 
pose  solution  using  Horn’s  algorithm  [68].  Given  a  set  of  inlier  feature  correspondences  and 
an  initial  orientation  guess,  Horn’s  algorithm  iteratively  calculates  a  relative-pose  estimate 
based  upon  enforcing  the  co-planarity  condition  over  all  ray  pairs  (i.e.,  if  a  ray  from  the 
left  and  right  camera  are  to  intersect  then  they  must  lie  in  a  plane  that  also  contains  the 
baseline).  If  the  orientation  guess  is  approximately  close  to  the  true  orientation,  Horn’s 
algorithm  quickly  converges  to  a  minimal  co-planarity  error  solution.  Since  orientation  can 
be  measured  with  bounded  precision  over  the  entire  survey  site  while  the  camera  baseline 
cannot,  we  use  Horn’s  algorithm  to  obtain  our  initial  5-DOF  relative-pose  solution,  which  is 
then  refined  in  a  two- view  bundle  adjustment  step  based  upon  minimizing  the  reprojection 
error  [64]. 
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Figure  2-8  Synthetically  normalizing  the  camera  orientation  via  the  infinite  homography. 


(a)  Original  distortion  compensated  coral  image. 


(b)  Normalized  image  using  H  oo  warp. 


2.4.3  Pose-Constrained  Correspondence  Search 

As  mentioned  in  §2.4.2,  the  problem  of  initial  feature  correspondence  establishment  is  ar¬ 
guably  the  most  difficult  and  challenging  task  of  a  feature-based  registration  methodology. 
As  we  show  in  this  section,  having  a  pose  prior  relaxes  the  demands  on  the  complexity  of 
the  feature  descriptor  —  instead  of  having  to  be  globally  unique  within  an  image,  it  now  is 
required  to  be  only  locally  unique.  We  use  the  epipolar  geometry  constraint  expressed  as 
a  two- view  point  transfer  model  to  restrict  the  correspondence  search  to  probable  regions. 
These  regions  are  determined  by  our  pose  prior  and  are  used  to  confine  the  interest  point 
matching  to  a  small  subset  of  candidate  correspondences.  The  benefit  of  this  approach  is 
that  it  simultaneously  relaxes  the  demands  of  the  feature  descriptor  while  at  the  same  time 
improves  the  robustness  of  similarity  matching. 


Epipolar  Uncertainty  Representation 

Zhang  [179]  first  characterized  epipolar  geometry  uncertainty  in  terms  of  the  covariance 
of  the  fundamental  matrix  while  Shen  [142]  used  knowledge  of  the  pose  prior  to  restrict 
the  correspondence  search  to  bands  along  the  epipolar  line  calculated  by  propagating  pose 
uncertainty.  However,  a  criticism  of  both  of  these  characterizations  is  that  the  uncertainty 
representation  is  hard  to  interpret  in  terms  of  physical  parameters  —  how  does  one  interpret 
the  covariance  of  a  line?  Our  approach  is  to  use  a  two-view  point  transfer  mapping  that 
benefits  from  a  direct  physical  interpretation  of  the  pose  parameters  and  in  addition  can 
take  advantage  of  scene  range  data  if  available.  While  similar  to  Lanser’s  technique  [83], 
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our  approach  does  not  assume  or  require  that  an  a  priori  CAD  model  of  the  environment 
exist. 


Two-View  Point  Transfer  Model 

In  deriving  the  point  transfer  mapping  we  assume  projective  camera  matrices  P  =  K[1 1 0] 
and  P'  =  K[R  1 1]  where  for  notational  convenience  we  drop  the  explicit  subscript/superscript 
notation  and  simply  write  the  relative  orientation  parameters  as  R,t  (i.e.,  R  =  ^R  and  ^t ,jc). 
We  begin  by  noting  that  the  scene  point  X  is  projected  through  camera  P  as 


u  =  PX  =  KX, 


which  implies  that  explicitly  accounting  for  scale  we  have 

X  =  ZK_1u.  (2.10) 

The  back-projected  scene  point  X  can  subsequently  be  reprojected  into  image  I'  as 

u'  =  P'X  =  K(RX  +  t).  (2.11) 


By  substituting  (2.10)  into  (2.11)  and  recognizing  that  the  following  relation  is  up  to  scale, 
we  obtain  the  homogeneous  point  transfer  mapping  [64]: 

u'  =  KRK_1u  +  Kt/Z.  (2.12) 


Finally,  by  explicitly  normalizing  (2.12)  we  recover  the 
mapping 

,  _  HooU  +  Kt /Z 

U  =  H^u  +  tjz 


non-homogeneous  point  transfer 


(2.13) 


where  Hqo  =  KRK-1,  refers  to  the  third  row  of  Hqq,  and  tz  is  the  third  element  of  t. 

When  the  depth  of  the  scene  point  Z  is  known  in  camera  frame  c,  then  (2.13)  describes 
the  exact  two- view  point  transfer  mapping.  However,  when  Z  is  unknown,  then  (2.13) 
describes  a  functional  relationship  on  Z  (i.e.,  u'  =  /( u,  Z))  that  traces  out  the  corresponding 
epipolar  line  in 


Proof.  Note  that  if  u'  lies  on  the  epipolar  line  1'  then  it  must  satisfy 

u,Tl'  =  0.  (2.14) 

Thus,  if  we  can  show  that  for  all  values  of  Z,  (2.12)  satisfies  (2.14),  then  we  can  deduce 
(2.13)  parameterizes  the  epipolar  line  in  V  as  a  function  of  Z.3 

u,Tl'  =  0 

u/TFu  =  0  V  =  Fu,  where  F  is  the  fundamental  matrix 
u'T  K‘t  [t]  x  RK-1  u  =  0  expanding  F 

V - v - ' 

F 
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(KRK  1u  +  Kt /Z)  K  T[t]xRK  1u  =  0  substituting  (2.12) 

' - v - ' 

a' 

(Ka  +  Kt/Z)TK~T[t]xa  =  0  defining  a  =  RK-1u 
(aTKT  +  tTKT/Z)K_T[t]xa  =  0 
(aT  +  tT/Z)  [t]  x  a  =  0 
o  ■  (t  x  Of)  4-  -gt  •  (t  x  ot)  —  0 

0  =  0  V  values  Z  Q.E.D. 


Point  Transfer  Mapping  with  Uncertainty 

Now  that  we’ve  derived  the  two- view  point  transfer  mapping  (2.13),  in  this  section  we  show 
how  we  can  use  it  to  constrain  our  correspondence  search  between  image  pair  /*,  Ij  by  using 
our  pose  prior  knowledge  from  £t.  We  begin  by  defining  the  parameter  vector  7  as 

7  =  [xj ,  xj. ,  Z,  u,  v]T  (2.15) 

with  mean  /i7  and  covariance  E7  given  by 
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Here,  xPi,  xPj  are  the  delayed-state  vehicle  poses  from  £t  used  to  calculate  the  relative 
camera  pose  xc>Ci  (2.6),  Z  and  oz  represent  the  scene  depth  parameters  as  measured  in 

camera  frame  z,  and  ( u,v )  describe  the  feature  location  in  pixels  in  image  Ii.  Note  that  in 

defining  E7  we  employ  the  common  assumption  that  features  are  extracted  with  isotropic, 
independent,  unit  variance  noise  [64],  To  obtain  a  first-order  estimate  of  the  uncertainty  in 
the  point  transfer  mapping  between  and  Ij  we  compute 

(213)1^  (2.16) 

£„/  «  JE7Jt  (2.17) 

where  /zu,  is  the  predicted  point  location  of  u  in  Ij,  £„>  its  variance,  and  J  =  is  the 
point  transfer  Jacobian.3 4  We  use  the  Gaussian  distribution  as  an  analytical  tool  to  compute 
first-order  search  bounds  in  (u\vf)  space  by  noting  that 

(u'  -  /ftt,)TE^1(u/  -  /v)  =  k2  (2.18) 

3[a]x  denotes  a  skew  symmetric  matrix  implementing  the  vector  cross-product  (i.e.,  [a]xb  =  axb). 

4 We  compute  this  Jacobian  numerically  using  the  algorithm  described  in  [64,  §A4.2]. 
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defines  an  ellipse  where  k2  follows  a  x!  distribution.  Hence,  we  can  choose  an  appropriate 
k2  such  that  with  probability  a  the  true  mapping  u„  falls  within  this  region.  Under  this 
scheme  we  test  all  feature  points  in  Ij  to  see  if  they  fall  within  the  ellipse  (2.18),  and  if  they 
do,  then  they  are  considered  to  be  candidate  matches  for  u.  Since  relative-pose  uncertainty 
depends  on  the  reference  frame  in  which  it  is  expressed,  we  apply  the  two- view  search 
constraint  both  forwards  and  backwards  to  obtain  a  consistent  candidate  correspondence 
set.  In  other  words,  candidate  matches  in  Ij  that  correspond  to  interest  points  in  /,  are 
checked  to  see  if  they  map  back  to  the  generating  interest  point  in  I{.  Based  upon  this  set 
of  consistent  candidate  matches,  feature  similarity  is  then  used  to  establish  the  one-to-one 
putative  correspondence  set. 

Algorithm  2.2  describes  the  pose-constrained  correspondence  search  in  pseudo-code  where 
we  use  scene  depth,  Z ,  and  its  uncertainty,  oz,  as  a  convenient  parameterization  for  con¬ 
trolling  the  size  and  shape  of  the  search  regions  in  Ij  as  illustrated  in  Fig.  2-9.  For  example, 
in  the  case  where  no  a  priori  knowledge  of  scene  depth  is  available,  choosing  any  finite 
value  for  Z  and  setting  oz  —*  oo  recovers  a  search  band  along  the  epipolar  line  in  Ij  whose 
width  corresponds  to  the  uncertainty  in  relative  camera  pose,  xCiCj  (Fig.  2-9(a)).  On  the 
other  hand,  when  knowledge  of  an  average  scene  depth,  Zavg,  exists  (e.g.,  from  an  altimeter) 
(Fig.  2-9(b)),  then  it  and  an  appropriately  chosen  oz  can  be  used  to  limit  the  search  space 
to  ellipses  centered  along  the  epipolar  lines  (Fig.  2-9(c)).  Furthermore,  in  the  case  where 
dense  scene  range  measurements  are  available  (e.g.,  from  a  laser  range  finder  or  scanning 
pencil-beam  sonar),  then  scene  depth,  Z,  can  be  assigned  on  a  point- by-point  basis  with 
high  precision.  In  any  case,  the  pose-constrained  correspondence  search  greatly  improves 
the  reliability  and  robustness  of  feature  similarity  matching  by  reducing  the  candidate  cor¬ 
respondence  set  to  a  relatively  few  number  of  options  as  demonstrated  in  Fig.  2-10. 

Algorithm  2.2  Pose-constrained  correspondence  search. 

Require;  Uj,  Uj ,  t ^PiPi>  ^PiPj>  ^PjPj>  ^ 1  ®z 

{Ui,  Uj  are  the  set  of  feature  points  in  /»,  Ij  respectively} 

1:  Cij  <—  0 UixUj  {initialize  feature  correspondence  matrix  ij  to  all  zeros} 

2:  for  all  Uj  6  U{  do  {forward  mapping  from  Ii  to  Ij} 

3:  assemble  7  as  in  (2.15) 

4:  do  point  transfer  nu<  <—  (2.16) £„/  <—  (2.17)|E^ 

5:  for  all  Uj  €  Uj  do  {test  all  points  in  Ij} 

6:  if  (uj  -  /iu/)TE“,1(uj  —  nu')  <  k2  then  {uj  lies  in  the  ellipse} 

7:  Cij(uj,Uj)  «—  1  {flag  Uj,  Uj  as  candidate  match} 

8;  end  if 
9;  end  for 
10:  end  for 

11:  repeat  the  above  with  the  roles  of  Ui,  Uj  reversed  to  get  Cji 

12:  C  <—  Cij  n  Cji  { forwards / backwards  intersection  yields  a  consistent  candidate  set} 

13:  assign  putative  matches  based  upon  similarity  measure  within  the  restricted  set  C 
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Figure  2-10  The  pose-constrained  candidate  correspondence  matrix  for  Fig.  2-9(c).  The 
rows/columns  correspond  to  a  ordering  of  the  feature  indices  in  U/Ij  respectively,  where  a  nonzero 
entry  indicates  a  potential  match.  Note  that  without  any  a  priori  pose  knowledge  this  matrix  would 
be  full.  Hence,  we  would  be  forced  to  rely  purely  upon  the  discriminatory  power  of  the  feature  simi¬ 
larity  measure  to  establish  correspondences.  Below,  we  see  that  the  pose-restricted  search  constraint 
has  reduced  the  possible  space  of  matches  by  over  97%. 


cmatrix  3.1 7%  non-zero 


2.4.4  Are  Pairwise  Camera  Measurements  Correlated? 

We  now  address  the  question  of  whether  or  not  pairwise  camera  measurements  are  corre¬ 
lated.  Recall  that  a  primary  assumption  in  an  EKF  fusion  framework  is  that  measurements 
are  assumed  to  be  corrupted  by  time  independent  noise  (2.1). 5  In  our  view-based  frame¬ 
work,  raw  images  are  registered  directly  to  produce  a  relative-pose  measurement  that  is 
then  fed  to  the  EKF  filter  as  a  relative  observation  between  two  states.  If  a  raw  image 
is  used  multiple  times  to  make  multiple  measurements,  for  example  /*  «-►  Ij  and  Ij  <-►  /*, 
then  this  raises  the  possibility  that  camera  measurements  z ^  and  Zjk  may  be  correlated. 
Neglecting  such  a  correlation  would  put  too  much  weight  on  the  measurement  update  step 
since  it  would  treat  each  observation  as  an  independent  corroboration  of  the  other.  Unfortu¬ 
nately,  as  with  any  view-based  “scan- matching”  framework  that  “reuses”  raw  data,  actually 
computing  the  measurement  correlation  is  intractable  and,  thus,  out  of  practicality  mea¬ 
surements  are  assumed  independent  [61,94].  However,  in  the  case  of  measurements  made 
from  low-overlap  imagery,  we  argue  that  this  independence  assumption  is  not  particularly 
far  from  the  truth. 

Our  camera-derived  relative-pose  measurement  and  covariance  are  generated  as  an  end- 
product  of  a  feature-based  two-view  maximum  likelihood  estimate  based  upon  minimizing 

5 A  typical  strategy  for  dealing  with  time-correlated  measurements  is  to  augment  the  state  description 
with  an  appropriately  chosen  linear  system  driven  by  white  noise.  The  output  of  this  “coloring”  process 
is  then  added  to  an  otherwise  noiseless  observation  model  to  account  for  the  time-correlated  measurement 
noise  [54]. 
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the  reprojection  error.  As  is  common  in  the  vision  community,  the  image  feature  locations 
are  assumed  to  be  corrupted  by  independent  isotropic  noise  of  unit  variance  [64]  (measured 
in  pixels).  Denoting  the  set  of  common  features  between  <->  Ij  as  and  the  set  between 
Ij  «-»  Ik  as  Fjki  the  implication  of  this  noise  assumption  is  that  for  null  pairwise  feature 
intersection  (i-e.,  Fa  n  Fjk  =  0),  the  corresponding  camera  measurements  z y  and  Zj*  are 
uncorrelated.  Hence,  assuming  pairwise  independence  is  true  for  temporally  consecutive 
images  with  50%  or  less  overlap,  and  approximately  true  for  spatial  measurements  where 
the  number  of  re-observed  point  correspondences  is  low.6 


Proof.  Assume  three  camera  frames  C\,  C2,  C3  where  image  pairs  {I\,h)  and  {h.h)  have 
spatial  overlap,  but  (Ji,  I 3)  do  not.  Define  C2  to  be  the  reference  coordinate  frame  and  X 
to  be  the  set  of  3D  world  points  viewed  by  the  three  camera  system  (i.e.,  X  =  {X12,  X23} 
where  X,j  is  the  set  of  3D  points  viewed  by  camera  Ci  and  Cj  and  X13  =  {0}). 

For  the  two-view  MLE  bundle  adjustment  we  define  our  parameter  vector  q  to  be 

q=  [P21.  P23.  xT]T 

where  p21  and  p23  represent  the  pose  vectors  of  cameras  C\  and  C3  with  respect  to  camera 
frame  C2  respectively.  The  reprojection  error  is  defined  as 


ei4  =  ui4  -  P(K,p21,X12j)  Camera  1 

e2 i  =  U24  -  P(K,06xi,Xi)  Camera  2 

e3<  =  u3,  -  P(K,  P23.  x23, )  Camera  3 


where  urii  is  the  ilh  feature  point  measured  in  image  In,  K  is  the  camera  calibration  matrix, 
Xj  is  the  ilh  3D  point,  and  P(K,  pnm,  Xnm< )  denotes  the  pinhole  projection  of  the  3D  point 
into  the  image  plane.  Stacking  all  of  the  reprojection  error  equations  we  have 


£i 


^total  — 


e  2 


.S3. 


where  £1,  £2,  £3  are  the  individual  image  error  measurements. 


T  iT 


ei  =  K,---  ,elm] 


£2  =  [e2l ,  •  •  •  ,  e2m+n]T 


T  iT 


£3  =  [e3l,--  -  ,e3J 


The  goal  of  bundle  adjustment  is  to  minimize  the  total  squared  error  cost 

C  total  ^  total 


over  all  views  by  optimizing  over  camera  poses  and  scene  structure  and  is  considered  to 
be  the  gold-standard  by  which  all  other  measures  are  judged.  This  nonlinear  optimization 
problem  is  solved  from  an  initial  guess  via  a  large-scale,  sparse,  partitioned  Levenberg- 
Marquardt  algorithm  [64,128]  that  takes  advantage  of  the  sparse  reprojection  error  Jacobian, 

6  Note  that  both  of  these  criteria  are  usually  met  in  a  typical  grid- based  AUV  survey  where  both  cross- track 
and  temporal  overlap  are  of  the  order  of  20-35%. 
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which  for  the  problem  under  consideration  is 


t  _  ^fetotal 

~~di~ 


total  total  total  total 

&P21  '  dP23  ’  5X12  ’  9X23 


Z&i  0mx6  °mxn 

0(m+n)x6  0(m+n)x6 

°nx6 


The  MLE  estimate  q  corresponds  to  the  local  minima  of  C  with  a  first-order  estimate  of  its 
parameter  covariance  given  by  [64] 

E,  =  JTJ 


de\  T  de 1 
5p2i  ^p2i 

06x6 

dei  T  de\ 

3p21  ax2\ 

06X771 

06x6 

aea_T  aea. 
tfP23  C,P23 

06xm 

dei  T  de 1 

5p23  ^X23 

41 

4! 

0mx6 

de\  de\  ,  de 2  de2 

5Xi2  3Xi2  5Xi2  0Xi  2 

de2  de2 

5Xi2  ax23 

_  06x6 

d£3  T  de.3 
5X23  <9P23 

d£2_ 

5X23  3Xi2 

de2  T  de2  ,  des  T  de 3 

ax23  5X23  dX23  ax23  . 

Note  that  in  the  above,  we’ve  employed  the  common  assumption  of  isotropic,  independent, 
unit  variance  feature  pixel  noise.  To  make  interpretation  easier,  we’ve  partitioned  the 
parameter  covariance  2,  into  camera  poses  (upper-left)  and  3D  structure  (lower-right). 
Careful  inspection  of  E,  clearly  shows  that  for  null  feature  intersection,  the  two  relative 
camera  poses  p21  and  p23  are  uncorrelated.7  Q.E.D. 


2.5  Results 

In  this  section  we  present  results  demonstrating  VAN’s  application  to  underwater  pose 
estimation.  The  first  set  of  results  is  for  a  real-world  dataset  collected  by  the  SeaBED  AUV 
during  a  benthic  habitat  classification  survey  conducted  at  the  Stellwagen  Bank  National 
Marine  Sanctuary.  The  second  set  of  results  are  for  experimental  validation  of  the  VAN 
framework  using  a  ROV  at  the  Johns  Hopkins  University  (JHU)  Hydrodynamic  Test  Facility 
with  ground-truth. 

2.5.1  Real-World  Results:  Stellwagen  Bank 
Experimental  Setup 

The  SeaBED  AUV  conducted  a  grid-based  survey  for  a  portion  of  the  Stellwagen  Bank 
National  Marine  Sanctuary  in  March  2003  [145].  The  vehicle  is  equipped  with  a  single 
down-looking  camera  and  is  instrumented  with  the  navigation  sensor  suite  tabulated  in 
Table  2.1,  pg.  50.  As  depicted  in  Fig.  2-11,  SeaBED  conducted  the  survey  in  a  bottom¬ 
following  mode  where  it  tried  to  maintain  constant  altitude  over  a  sloping  rocky  ocean 
bottom.  The  intended  survey  pattern  consisted  of  15  North/South  legs  each  180  meters 
long  and  spaced  1.5  meters  apart  while  maintaining  an  average  altitude  of  3.0  meters  above 

7A1so  note  that  the  3D  structure  X12  and  X23  are  uncorrelated  as  well  since  1  =  0mxn 
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Figure  2-11  A  depiction  of  Stellwagen  Bank’s  topography.  Since  the  vehicle  was  trying  to  maintain 
constant-altitude  the  depth  plot  (a)  is  a  proxy  for  terrain  variation  —  notice  that  the  depth  excursions 
are  on  the  order  of  several  meters. 


the  seafloor  at  a  forward  velocity  of  0.35  m/s.  Closed-loop  feedback  on  the  DR  navigation 
estimate  was  used  for  real-time  vehicle  control. 

We  processed  a  small  subset  of  the  dataset  using  100  images  from  a  South/North  track¬ 
line  pair,  the  results  of  which  are  shown  in  Fig.  2-13.  Plot  (b)  on  the  right  depicts  the  VAN 
estimated  camera  trajectory  and  its  3-sigma  confidence  bounds.  Successfully  registered  im¬ 
age  pairs  are  indicated  by  the  red  and  green  links  connecting  the  camera  poses  where  green 
corresponds  to  temporally  consecutive  image  frames  and  red  to  spatially  neighboring  image 
frames.  For  comparison  purposes,  plot  (a)  on  the  left  depicts  the  DR  trajectory  overlaid  on 
top  of  the  VAN  estimated  XY  trajectory.  Both  plots  are  in  meters  where  x  is  East  and  Y  is 
North. 

Our  feature-based  registration  algorithm  was  successful  in  automatically  establishing 
putative  correspondences  between  sequential  image  pairs  (green  links),  however,  automatic 
cross-track  image  registration  (red  links)  proved  to  be  too  difficult.  The  cause  for  this 
was  due  to  significant  variation  in  scene  appearance  as  illuminated  from  different  vantage 
points.8  Furthermore,  these  variations  were  complicated  by  the  fact  that  the  vehicle  flew 
on  reciprocal  headings  alternating  every  other  leg  of  the  grid-based  survey.  Hence,  shadows 
were  cast  in  opposite  directions  for  parallel  tracklines.  Therefore,  for  this  dataset  cross-track 
putative  correspondences  were  manually  established  for  19  image  pairs,  which  are  indicated 
by  the  red  spatial  links  in  Fig.  2-13. 


8The  example  shown  in  Fig.  l-13(b),  pg.  36  displays  cross-track  imagery  from  this  dataset. 
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Experimental  Results 

A  number  of  important  observations  in  Fig.  2-13  are  worth  pointing  out. 

1.  First,  note  that  the  VAN  uncertainty  ellipses  are  smaller  for  camera  poses  that  are 
constrained  by  spatial  links.  Since  spatial  links  provide  a  mechanism  for  relating 
past  vehicle  poses  to  the  present,  they  also  provide  a  means  for  correcting  DR  drift 
error.  While  trajectory  uncertainty  in  a  DR  navigation  system  grows  monotonically 
unbounded  with  time,  in  contrast  VAN’s  error  growth  is  essentially  a  function  of 
network  topology  and  distance  away  from  the  zeroth  reference  node  (i.e.,  the  first 
image). 

2.  A  second  observation  worth  noting  is  the  delayed-state  smoothing  that  occurs  in  the 
VAN  framework.  Spatial  links  not  only  decrease  the  uncertainty  of  the  image  pair 
involved,  but  also  decrease  the  uncertainty  of  other  delayed-states  that  are  correlated. 
In  particular,  Fig.  2-12  characterizes  the  time-evolution  of  the  view-based  map  uncer¬ 
tainty  by  plotting  the  trace  of  the  XY  sub-block  versus  image  frame  number.  Note  the 
sudden  decrease  occurring  at  image  frame  754  —  this  event  coincides  with  the  first 
cross-track  link.  Information  from  that  spatial  measurement  is  propagated  along  the 
network  topology  to  other  vehicle  poses  via  the  shared  correlations  in  the  covariance 
matrix. 

3.  Thirdly,  referring  back  to  Fig.  2-13,  note  that  a  temporal  (green)  link  does  not  exist 
between  consecutive  image  frames  near  XY  location  (—4,0).  A  break  like  this  in  the 
temporal  image  chain  prevents  concatenation  of  the  relative  camera  measurements  and 
in  a  purely  vision-only  approach  could  cause  algorithms  that  depend  on  a  connected 
topology  to  fail.  It  is  a  testament  to  the  robustness  of  VAN  that  a  disconnected 
camera  topology  does  not  present  any  significant  issue.  The  key  to  its  success  is 
that  navigation  allows  correlation  to  be  built  between  temporal  poses  even  though  a 
camera  measurement  may  not  exist. 

4.  Finally,  an  additional  point  worth  mentioning  is  that  VAN  results  in  a  self-consistent 
estimate  of  the  vehicle’s  trajectory.  Referring  to  Fig.  2-14,  initial  processing  of  the  im¬ 
age  sequence  resulted  in  a  VAN  trajectory  estimate  that  did  not  lie  within  the  3-sigma 
confidence  bounds  predicted  by  DR.  In  particular,  VAN  recovered  a  crossing  trajec¬ 
tory  similar  to  Fig.  2-13  while  the  DR  estimate  consisted  of  two  parallel  South/North 
tracklines.  Upon  further  investigation  it  became  clear  that  the  cause  of  this  discrep¬ 
ancy  was  a  significant  nonlinear  heading  bias  in  the  magnetic  flux  gate  compass.  We 
used  an  independently  collected  dataset  to  calculate  a  compass  bias  correction  and 
then  applied  it  to  our  heading  data  to  produce  the  results  shown  in  Fig.  2-13  where 
DR  and  VAN  are  now  in  agreement.  Essentially,  VAN  camera-derived  measurements 
had  been  good  enough  to  compensate  for  the  large  heading  bias  and  still  recover  a 
consistent  vehicle  trajectory  despite  the  unmodeled  compass  error  (recall  that  in  a 
Kalman  update  the  prior  will  essentially  be  ignored  if  the  measurements  are  very 
certain). 
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Figure  2-12  Time-evolution  of  the  Stellwagen  Bank  pose  network  uncertainty.  For  each  delayed- 
state  entry  in  (i.e.,  for  each  vehicle  pose  x*),  the  trace  of  its  2  x  2  XY  covariance  matrix  is  plotted 
versus  image  frame  number.  The  colored  dots  depict  the  pose  uncertainty  at  image  insertion  and  the 
lines  show  their  time  evolution  for  every  5th  delayed-state.  A  couple  key  events  are  worth  pointing 
out.  First,  notice  the  monotonically  increasing  uncertainty  in  XY  position  between  frames  700-753. 
This  period  corresponds  to  when  only  temporally  consecutive  image  frame  measurements  could  be 
made.  Second,  notice  the  regional  smoothing  and  sharp  decrease  in  uncertainty  for  correlated  state 
poses  at  frame  number  754.  Frame  754  corresponds  to  the  first  cross-track  spatial  measurement 
made  by  the  camera.  Finally,  note  that  the  uncertainty  in  XY  pose  continues  to  decrease  from  frame 
number  754-763  as  more  cross-track  measurements  are  made.  From  frame  764  onward,  uncertainty 
begins  to  increase  again  as  no  more  spatial  measurements  can  be  made.  Shown  below  the  covariance 
plot  is  a  bar  graph  of  the  number  of  successfully  registered  image  pairs  for  each  frame  number. 
Temporally  consecutive  camera  measurements  are  shown  in  green,  and  the  number  of  spatial  cross- 
traek  measurements  are  in  red.  Notice  that  the  decrease  in  XY  uncertainty  in  the  covariance  plot 
coincides  with  the  first  cross-track  measurement  made  by  the  camera  (frame  number  754). 
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Figure  2-14  VAN  recovers  a  camera-consistent  trajectory  estimate  despite  an  unmodeled  compass 
heading  bias.  The  DR  trajectory  (gray)  does  not  lie  within  the  3-sigma  VAN  estimate  (blue).  This 
discrepancy  comes  from  an  unmodeled  compass  bias,  which  when  accounted  for,  produces  the  results 
shown  in  Fig.  2-13.  It  is  a  testament  to  the  robustness  of  VAN  that  camera-derived  relative-pose 
measurements  forced  the  recovery  of  a  consistent  cross-over  trajectory  despite  having  an  unmodeled 
heading  bias. 


North (m] 


2.5.2  Experimental  Validation:  JHU  Test  Tank 

To  validate  the  error  characteristics  of  VAN  as  compared  to  traditional  DR,  we  collaborated 
with  our  colleagues  at  JHU  to  collect  an  in-tank  ROV  dataset  with  ground  truth.9 

Experimental  Setup 

The  experimental  setup  consisted  of  a  single  downward-looking  digital  still  camera  mounted 
to  a  moving  underwater  pose  instrumented  ROV  at  the  JHU  Hydrodynamic  Test  Facility. 
Their  vehicle  is  instrumented  with  a  typical  suite  of  oceanographic  dead-reckoning  nav¬ 
igation  sensors  capable  of  measuring  heading,  attitude,  XYZ  bottom-referenced  Doppler 
velocities,  and  a  pressure  sensor  for  depth.  The  vehicle  and  test  facility  are  also  equipped 
with  a  high  frequency  acoustic  LBL  system,  which  provides  centimeter-level  bounded  error 
XY  vehicle  positions  used  for  validation  purposes  only.  A  simulated  seafloor  environment 
was  created  by  placing  textured  carpet,  riverbed  rocks,  and  landscaping  boulders  on  the 
tank  floor  and  was  appropriately  scaled  to  match  a  rugged  seafloor  environment  with  con¬ 
siderable  3D  scene  relief.  See  Fig.  2-15  for  depiction  of  the  experimental  setup. 

In  addition,  we  also  tested  an  innovative  dual- light/ camera  configuration  by  placing  fore 
and  aft  lights  on  the  ROV  with  the  camera  mounted  in  the  center  as  shown  in  Fig.  2-16. 
The  purpose  of  this  test  was  to  see  if  we  could  mitigate  the  viewpoint  variant  illumination 
effects  that  had  prevented  us  from  automatically  establishing  cross-track  correspondences  in 
the  Stellwagen  Bank  dataset.  As  the  results  in  this  section  show,  the  outcome  was  success¬ 
ful.  The  dual-light  configuration  alleviated  viewpoint  illumination  effects  by  improving  the 
signal-to-nose  ratio  in  shadowed  regions  so  that  fully  automatic  cross-track  correspondences 
were  achieved. 


9We  graciously  thank  our  hosts  Louis  Whitcomb  and  James  Kinsey  for  their  collaboration  in  collecting 
the  JHU  dataset. 
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Figure  2-15  The  JHU  experimental  setup.  Low-pile  carpet,  artificial  landscaping  boulders,  and 
riverbed  rock  were  all  placed  on  the  tank  floor  to  create  a  natural  looking  seafloor  with  extensive 
scene  relief  for  a  camera  altitude  of  1.5  m. 


(a)  JHU  ROV. 


(b)  Partial  view  of  the  artificial  scene  constructed  on  the  tank  floor. 


Figure  2-16  The  dual-light  setup  used  on  the  JHU  ROV.  This  experimental  dual-light  configuration 
with  the  camera  mounted  in  the  center  made  fully  automatic  image  registration  robust  to  the  effects 
of  viewpoint  variant  scene  illumination.  Traditional  single-light  configurations  (c)  cast  significant 
shadows  and  cause  objects  to  look  very  different  from  differing  vantage  points.  This  makes  automatic 
correspondence  establishment  very  difficult.  Meanwhile,  the  experimental  dual-light  configuration 
(b)  increases  image  illumination  invariance  by  casting  “double-shadows”  that  the  camera  can  “see 
through” . 
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Experimental  Results 

Fig.  2-17  shows  the  estimated  XY  trajectory  for  a  101  image  sequence  comprised  of  roughly 
25%  temporal  overlap.  For  this  experiment,  the  vehicle  started  near  the  top-left  corner  of 
the  plot  at  (-2.5,2.75)  and  then  drove  a  course  consisting  of  two  grid-based  surveys,  one 
oriented  SW  to  NE  and  the  other  W  to  E.  Both  plots  show  the  spatial  XY  pose  topology,  3- 
sigma  confidence  bounds,  and  network  of  camera  constraints  —  note  that  the  VAN  result  is 
end-to-end  fully  automatic.  Green  links  correspond  to  temporally  consecutive  images  that 
were  successfully  registered  while  red  links  correspond  to  spatially  registered  image  pairs 
—  in  all  there  are  307  camera  constraints  (81  temporal  /  226  spatial).  Notice  that  the  XY 
uncertainty  in  the  DR  estimate  grows  monotonically  with  time  while  in  the  VAN  estimate  it 
is  constrained  by  the  camera-link  topology.  Fig.  2-18  further  corroborates  this  observation 
and  in  particular  Fig.  2- 18(b)  shows  that  VAN  exhibits  a  linear  trend  in  error  growth  as  a 
function  of  distance  away  from  the  reference  node.  Note  that  the  spread  of  points  away  from 
this  linear  fit  is  due  to  inhomogeneity  in  the  number  of  edges  per  node  in  the  corresponding 
pose-constraint  network.  Nonetheless,  this  raises  the  interesting  engineering  question  of  how 
one  might  go  about  reducing  the  slope  of  the  linear  relationship  exhibited  in  Fig.  2-18(b)? 
From  a  camera  perspective,  design  criteria  that  could  help  improve  this  performance  are: 

•  Higher  resolution  images.  Increased  resolution  improves  both  the  accuracy  and  preci¬ 
sion  with  which  2D  feature  points  can  be  extracted  and  localized  within  the  viewable 
image  plane.  This  in  turn  improves  the  accuracy  and  precision  of  the  relative-pose 
camera  measurement. 

•  A  wider  FOV.  Increasing  the  camera’s  FOV  improves  the  pairwise  observability  of 
camera  motion  and,  hence,  the  overall  precision  of  the  camera-derived  relative-pose 
measurement.  However,  increasing  the  FOV  also  results  in  lower  image  resolution,  so 
a  good  balance  between  the  two  should  be  found. 

•  Characterize  feature  repeatability.  Recall  that  our  image  registration  module  employs 
the  common  assumption  that  features  are  extracted  with  independent,  isotropic,  unit 
variance  pixel  noise.  This  noise  model  does  not  have  any  real  physical  basis,  but  rather 
is  assumed  merely  for  convenience.  Hence,  it  would  be  worthwhile  to  setup  a  testbed  of 
seafloor  imagery  for  measuring  the  repeatability  of  our  image  feature  extractors  under 
different  viewing,  surface,  and  lighting  conditions.  This  would  provide  a  more  accurate 
characterization  of  the  feature  extraction  precision  and,  thus,  a  better  description  for 
the  overall  precision  of  our  relative-pose  camera  measurements. 

•  Better  camera  calibration.  Our  registration  framework  assumes  that  we  axe  using 
a  calibrated  camera,  which  implies  that  the  projective  mapping  from  Euclidean  ray 
space  to  image  pixel  space  is  known.  Therefore,  a  poor  calibration  could  introduce 
a  persistent  bias  into  our  camera-derived  relative-pose  measurements  and  effect  the 
overall  consistency  of  the  state  estimate.  Hence,  obtaining  a  good  calibration  is  im¬ 
portant. 
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Figure  2-18  Uncertainty  characteristics  of  VAN  versus  DR  for  the  JHU  tank  dataset  shown  in 
Fig.  2-17.  Plots  (a)  and  (b)  show  the  determinant  of  the  XY  sub-block  for  each  vehicle  pose  in  the 
view-based  map.  The  determinant  is  plotted  versus  both  path  length  and  Euclidean  distance  away 
from  the  first  image  in  the  view-based  map.  Notice  that  the  DR  uncertainty  is  clearly  a  monotonic 
function  of  path  length  whereas  VAN  uncertainty  is  related  to  the  distance  away  from  the  reference 
image.  Plots  (c)  and  (d)  show  the  same  trend  but  represented  in  a  different  way.  Here,  the  XY  pose 
determinant  for  each  view  is  plotted  in  a  scatter  plot  versus  both  reference  image  distance  and  path 
length  —  the  size  and  color  of  each  disk  is  proportional  to  the  determinant.  In  this  representation  we 
see  that  VAN  and  DR  have  orthogonal  uncertainty  characteristics  with  DR  growing  per  path  length 
(which  is  a  proxy  for  time)  and  VAN  growing  per  distance  away  from  the  anchor  image. 
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(c)  DR  uncertainty  represented  as  a  scatter  plot.  (d)  VAN  uncertainty  represented  as  a  scatter  plot. 
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2.6  Chapter  Summary 

This  chapter  presented  a  systems-level  framework  for  visual  navigation  termed  “visually 
augmented  navigation”.  VAN’s  systems-level  approach  leads  to  a  robust  solution  that 
exploits  the  complementary  characteristics  of  a  camera  and  strap-down  sensor  suite  to 
overcome  the  peculiarities  of  low-overlap  underwater  imagery.  Key  strengths  of  the  VAN 
framework  were  shown  to  be: 

•  Self-consistency.  Camera  measurements  forced  the  VAN  trajectory  of  Fig.  2-14  to 
“cross-over”  despite  the  presence  of  an  unmodeled  compass  bias. 

•  Robustness.  Trajectory  estimation  gracefully  handles  having  a  disconnected  temporal 
image  chain  since  navigation  builds  correlation  between  camera  poses. 

•  Smoothing.  The  delayed-state  EKF  framework  means  that  information  from  loop¬ 
closing  events  gets  distributed  throughout  the  entire  map  via  the  joint-correlations. 

•  Bounded  error  characteristics.  Uncertainty  in  a  DR  system  grows  monotonically  time, 
while  in  a  VAN  approach  it  is  a  function  of  network  topology.  Essentially,  VAN  allows 
error  to  be  a  function  of  space  and  not  time  —  space  being  distance  away  from  the 
reference  node  in  a  connected  topology. 

While  the  above  strengths  are  promising  attributes  of  a  standalone  precision  navigation 
system,  it  is  well  known  that  a  vanilla  EKF  SLAM  implementation  is  limited  to  relatively 
small  environments  due  to  the  0(n2)  computational  complexity  per  update  to  maintain  the 
covariance  matrix  (Fig.  2-19).  In  practical  terms,  this  implies  that  VAN  can  only  maintain  a 
hundred  or  so  images  in  its  view-based  map.  In  Chapters  3  and  4  we  address  this  scalability 
issue  by  exploring  the  re-parameterization  of  our  state  estimate  in  terms  of  the  dual  of  an 
extended  Kalman  filter  —  an  extended  information  filter.  In  particular,  Chapter  3  discusses 
feature-based  SLAM  and  the  recently  derived  sparse  extended  information  filter  (SEIF)  by 
Thrun  et  al.  [160].  SEIFs  relies  upon  making  approximations  in  the  information  form  to 
obtain  a  sparse  representation  allowing  for  efficient  updates.  However,  as  we  show,  this 
approximation  leads  to  unintended  consequences  regarding  filter  consistency.  Meanwhile, 
Chapter  4  presents  the  novel  insight  that  a  view-based  SLAM  framework  has  exact  sparsity 
in  the  information  form.  The  implication  of  this  is  that  we  can  retain  VAN’s  promising 
navigation  attributes  while  exploiting  the  sparse  representation  to  achieve  0(n)  algorithmic 
complexity. 
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Figure  2-19  An  EKF’s  update  time  grows  quadrat ically  with  state  size.  This  figure  depicts  CPU 
time  versus  map  size  for  the  JHU  dataset  of  Fig.  2-17. 
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CHAPTER  3 _ , 

i  : 1 

I - Sparse  Extended  Information  Filters:  Insights  into  Sparsification 


Recently,  there  have  been  a  number  of  variant  simultaneous  localization  and  mapping 
algorithms  that  have  made  substantial  progress  towards  large-area  scalability  by  pa- 
J  rameterizing  the  SLAM  posterior  within  the  information  (canonical/inverse  covari¬ 
ance)  form.  Of  these,  probably  the  most  well-known  and  popular  approach  is  the  sparse 
extended  information  filter  (SEIF)  by  Thrun  et  al.  [160].  While  SEIFs  have  been  success¬ 
fully  implemented  with  a  variety  of  challenging  real-world  data  sets  and  have  lead  to  new 
insights  into  scalable  SLAM,  open  research  questions  remain  regarding  the  approximate 
sparsification  procedure  and  its  effect  on  map  error  and  consistency. 

In  this  chapter,  we  examine  the  constant-time  SEIF  sparsification  procedure  as  it  per¬ 
tains  to  feature-based  SLAM  and  offer  new  insight  into  issues  of  consistency.  In  particular, 
we  show  that  exaggerated  map  inconsistency  occurs  within  the  global  reference  frame  where 
estimation  is  performed,  but  that  empirical  testing  shows  that  relative  map  relationships 
are  preserved.  We  then  present  a  slightly  modified  version  of  their  sparsification  procedure, 
which  is  shown  to  preserve  sparsity  while  also  generating  both  local  and  global  map  esti¬ 
mates  comparable  to  those  obtained  by  the  non-sparsified  SLAM  filter  (i.e.,  full-covariance 
EKF).  This  modified  approximation,  however,  is  no  longer  constant-time.  We  demonstrate 
our  findings  by  benchmark  comparison  of  the  modified  and  original  SEIF  sparsification 
rules  using  a  linear  Gaussian  SLAM  simulation  and  a  real-world  experiment  for  a  nonlinear 
dataset.  From  this  chapter  we  conclude  that  approximate  sparsification  is  a  necessary  re¬ 
quirement  for  feature-based  SLAM  to  be  efficient  in  the  information  form  —  however,  this 
approximation  is  non-trivial. 

3.1  Introduction 

Since  its  inception  with  the  fundamental  work  of  Smith,  Self,  and  Cheeseman  [154]  and 
Moutarlier  and  Chatila  [110],  roboticists  have  been  trying  to  address  scalability  issues  as¬ 
sociated  with  an  extended  Kalman  filter  based  approach  to  SLAM.  While  this  approach 
is  often  considered  the  “standard”  [30]  and  is  attractive  in  its  simplicity  (because  it  only 
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requires  tracking  first  and  second  moments  of  the  joint  landmark- robot  distribution),  a  well 
known  fact  is  that  EKF  SLAM  inference  requires  quadratic  complexity  in  the  number  of 
landmarks  per  update  to  maintain  the  joint-posterior  correlations.  As  a  consequence,  the 
direct  application  of  EKF  SLAM  is  limited  to  relatively  small  environments  (e.g.,  less  than 
than  100  landmarks). 

3.1.1  A  Survey  of  Scalable  Feature-Based  SLAM  Algorithms 
Submaps 

Over  the  years  a  number  of  different  approaches  have  been  put  forth  to  try  and  curb  this 
quadratic  cost  with  map  size.  One  of  the  more  conceptually  straightforward  approaches 
has  been  the  idea  of  dividing  the  world  into  local  submaps  [11,85,86].  Submaps  are  based 
upon  decomposing  the  environment  into  manageable  “local  pieces”  thereby  bounding  each 
map’s  computational  growth  by  limiting  its  size  (both  in  a  physical  sense  and  in  terms  of 
the  number  of  local  features  it  contains).  While  these  approaches  have  been  demonstrated 
to  be  computationally  efficient  for  mapping  large  cyclic  environments  (for  example  the  Atlas 
framework  by  Bosse  et  al.  [11],  and  the  CTS  algorithm  by  Leonard  and  Newman  [86]),  they 
sacrifice  convergence  rates  by  not  explicitly  enforcing  measurement  constraints  in  full  across 
the  network  of  submaps  [86].  In  addition,  the  definition  of  a  submap  tends  to  be  rather 
ad-hoc,  which  leads  to  a  poor  representation  for  the  border  features  [160]. 

Postponement 

Another  closely  related  idea  to  submaps  is  the  concept  of  postponement  [27]  and  its  vari¬ 
ants  [173].  Postponement  is  built  around  the  standard  EKF  SLAM  approach,  but  employs 
a  clever  “bookkeeping”  technique  to  the  update  equations  whereby  the  robot  is  able  to 
efficiently  work  in  a  “local”  map  context  without  the  burden  of  maintaining  the  “global” 
map.  It  is  proven  to  yield  mathematically  equivalent  results  to  the  full  EKF  implementation 
while  simultaneously  relieving  the  computational  load  when  working  locally;  as  long  as  the 
robot  observes  elements  within  the  local  environment,  measurements  can  be  efficiently  in¬ 
corporated  with  bounded  complexity.  Upon  moving  to  a  new  area,  however,  the  robot  must 
first  fuse  the  local-map  into  the  global  representation  at  the  standard  0(n2)  cost  (where  n 
is  the  total  number  of  global  landmarks)  —  hence  the  name  “postponement” . 

FastSLAM 

In  addition  to  the  EKF  approaches,  FastSLAM  [108]  has  appeared  in  the  recent  literature 
as  a  large-scale  SLAM  algorithm  based  upon  an  entirely  different  representation  of  the  joint 
landmark-robot  posterior.  FastSLAM  employs  Rao-Blackwellized  particle  filtering  [32]  to 
decouple  the  joint  landmark-robot  distribution  into  n  independent  landmark  estimation 
problems  by  using  a  particle  filter  to  sample  over  the  robot  trajectory.  This  landmark  fac¬ 
torization  is  exact  based  upon  each  particle  representing  a  realization  of  the  robot’s  path 
and  leads  to  an  O(ralogn)  update  complexity  where  m  is  the  number  of  particles  used  to 
represent  the  trajectory.  The  algorithm  has  the  benefit  of  intrinsically  handling  multiple 
data  association  hypotheses  (i.e.,  the  problem  of  establishing  correspondences  to  measure¬ 
ments  [118])  and  its  scalability  has  been  demonstrated  in  simulation  by  building  maps  with 
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over  50,000  features  [108].  However,  as  a  significant  drawback  to  its  general  applicabil¬ 
ity,  the  theoretical  relationship  between  the  size  of  the  mapped  area  and  the  number  of 
required  particles  is  poorly  understood  [160]  (and  conjectured  to  even  be  exponential  in 
environmental  size  [161]). 

Covariance  Intersection 

Lastly,  the  technique  of  covariance  intersection  (Cl)  [75]  represents  yet  another  approach  to 
large  scale  map  making  distinctly  different  from  the  EKF  framework  and  additionally  has  the 
benefit  of  constant-time  fusion  with  linear  storage.  It  achieves  low  computational  complexity 
by  essentially  ignoring  the  joint  landmark  covariances  and  adopting  a  conservative  fusion 
strategy  that  respects  all  possible  correlations  [23].  The  empirical  result  has  been  that 
while  Cl  can  handle  a  million  landmark  map  [75],  its  fusion  strategy  is  so  conservative  that 
practical  convergence  rates  cannot  be  achieved. 

3.1.2  The  Information  Form  and  a  New  Class  of  Algorithms 

Recently,  a  new  class  of  scalable  SLAM  algorithms  have  been  proposed  by  Thrun  et  al.  [160], 
Paskin  [120],  and  Frese  [49,51],  and  are  all  based  upon  the  canonical- form.  This  represen¬ 
tation  has  the  nice  interpretation  as  a  Gaussian  graphical  model  [120,  135].  As  Thrun 
et  al.  [160]  empirically  first  showed,  and  Frese  later  analytically  proved  [50],  the  inverse 
covariance  matrix  (i.e.,  information  matrix)  of  feature-based  SLAM  exhibits  a  “natural” 
sparseness  whereby  many  of  the  off-diagonal  elements  (i.e.,  graphical  constraints)  are  rela¬ 
tively  “weak”  (see  Fig.  3-1).  This  insight  has  spawned  the  development  of  scalable  SLAM 
algorithms  founded  upon  pruning  these  weak  constraints  and  exploiting  the  resulting  sparse 
representation  [49,120,160]. 


Figure  3-1  Feature-based  SLAM  exhibits  a  “natural”  sparsity  in  the  information  form.  A  compar¬ 
ison  of  the  structure  of  the  covariance  and  information  matrices  as  is  typically  seen  in  feature-based 
SLAM  implementations;  darker  shades  represent  larger  magnitudes,  (a)  The  correlation  matrix  is 
dense  and  requires  quadratic  storage,  (b)  Normalizing  the  information  matrix  in  the  same  manner 
as  the  correlation  matrix  yields  an  “almost  sparse”  representation  where  a  majority  of  the  elements 
are  orders  of  magnitude  smaller  than  the  dominant  entries. 


(a)  Normalized  covariance  matrix.  (b)  Normalized  information  matrix. 
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For  example,  both  Paskin  (Thin- Junction- Tree  Filters)  [120]  and  Frese  (TYeemap  Filters) 
[49]  employ  tree-based  approximations  to  sparsify  the  canonical-form  and  have  developed 
very  efficient  inference  algorithms  for  this  representation.  One  drawback  to  their  techniques, 
though,  is  that  their  tree- based  representations  cannot  explicitly  model  cyclic  environments 
nor  has  data  association  been  addressed.  Alternatively,  the  sparse  extended  information 
filter  proposed  by  Thrun  et  al.  [160],  probably  the  most  well  known  SLAM  information 
formulation,  is  based  upon  representing  the  SLAM  posterior  through  the  dual  of  the  extended 
Kalman  filter.  SEIFs  maintain  a  sparse  information  matrix,  an  approach  which  has  been 
demonstrated  to  be  both  efficient  and  scalable,  allows  for  explicit  representation  of  cyclic 
environments,  and  addresses  data  association  [92].  The  delicate  issue,  however,  is  how  to 
perform  the  necessary  sparsification  step  required  to  keep  the  information  matrix  sparse. 

In  this  chapter,  we  explore  in  depth  the  approximation  employed  by  SEIFs  to  enforce 
sparseness.  We  show  that  a  particular  assumption  in  its  derivation  leads  to  inconsistency 
of  the  global  map  error  covariance  estimates,  however,  empirical  testing  indicates  that  local 
map  relations  and  relative  uncertainties  are  preserved.  In  addition,  we  present  a  slightly 
modified  derivation  that  yields  an  alternative  sparsification  rule,  which  is  shown  to  produce 
both  global  and  local  map  estimates  comparable  to  the  full-covariance  EKF  while  also 
maintaining  the  same  level  of  sparsity  as  SEIFs.  Its  drawback,  though,  is  that  sparsification 
is  no  longer  constant-time.  We  demonstrate  our  insights  by  concluding  with  a  benchmark 
comparison  for  a  linear  Gaussian  SLAM  simulation  and  in  addition  present  results  for  a 
nonlinear  experimental  dataset. 


3.2  Background 

3.2.1  The  Gaussian  Information  Form 

The  information  form  is  often  called  the  canonical  or  natural  representation  of  the  Gaussian 
distribution.  This  notion  of  a  “natural”  representation  stems  from  expanding  the  quadratic 
in  the  exponential  of  the  Gaussian  distribution  as 

p{€t)  =  Mti  ^t) 

=  -^=exp{-I(^Sr^(  -2/xtTEt-1^  +  MtTSt-Vt)} 
e-$M,T£rVt 

v/l^r1! 

=  AT-1  (^t;  T7e,  At) 

where  At  =  E^-1  and  r\t  =  Atfit.  The  result  is  that  rather  than  parameterizing  the  normal 
distribution  in  terms  of  its  mean  and  covariance,  Af(£tm,  fit,  E*),  it  is  instead  parametrized 
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in  terms  of  its  information  vector  and  information  matrix,  Af  l(€t',T]t,  A*)  [7]. 


3.2.2  Marginalization  and  Conditioning 

The  covariance  and  information  representations  lead  to  very  different  computational  char¬ 
acteristics  with  respect  to  the  fundamental  probabilistic  operations  of  marginalization  and 
conditioning.  This  is  important  because  these  two  operations  appear  at  the  core  of  any 
SLAM  algorithm,  for  example  motion  prediction  and  measurement  updates.  Table  3.1  sum¬ 
marizes  these  operations  for  a  Gaussian  distribution  where  we  see  that  the  covariance  and 
information  representations  exhibit  a  dual  relationship  with  respect  to  marginalization  and 
conditioning.  For  example,  marginalization  is  easy  in  the  covariance  form  since  it  corre¬ 
sponds  to  extracting  the  appropriate  sub-block  from  the  covariance  matrix  while  in  the 
information  form  it  is  hard  because  it  involves  calculating  the  Schur  complement  over  the 
variables  we  wish  to  keep.  Notice  that  the  opposite  relation  holds  true  for  conditioning, 
which  is  easy  in  the  information  form  and  hard  in  the  covariance  form. 


Table  3.1  Summary  of  the  marginalization  and  conditioning  operations  for  the  Gaussian  distribution 
as  expressed  in  both  the  covariance  and  information  form.a 


p(a,/3)  -  A/’j 


P-g 


=  Af~1 


Aaa  Aa0 

A0a  Affp 

Marginalization 

Conditioning 

p(a)  =  fp(a,l3)dp 

p(a\0)  =  p(a,0)/p(P) 

Cov. 

T: 

II 

0 

M  =  Ma  +  p'EppiP  ~  ^p) 

Form 

£  —  £tta 

=  Eaa  — 

Info. 

Vi  =  TJa  -  Aa0P 

Form 

A  =  Aqq  —  AapApp  A^q 

A'  =  Aaa 

aA  derivation  of  these  operations  can  be  found  in  Appendix  §B.l. 


3.2.3  Controlling  Feature-Based  SLAM  Sparsity 

Most  SLAM  approaches  are  feature-based,  which  assumes  that  the  robot  can  extract  an 
abstract  representation  of  features  in  the  environment  from  its  sensor  data  and  then  use 
re-observation  of  these  features  for  localization  [154].  In  this  approach,  a  landmark  map 
is  explicitly  built  and  maintained.  The  process  of  concurrently  performing  localization 
and  map  building  are  inherently  coupled,  therefore,  the  robot  must  then  represent  a  joint- 
distribution  over  landmarks  and  current  pose,  i.e., 

P(*t  I  z‘,u‘)  =  (3.1) 
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where  £t  =  [x^,MT]T  represents  the  current  robot  state  and  landmark  map  respectively, 
zl  the  set  of  measurements  up  to  time  £,  and  ul  the  set  of  control  inputs.  In  (3.1)  we 
have  explicitly  modeled  the  distribution  as  being  jointly-Gaussian  based  upon  additive 
white  noise  models  and  first-order  linearizations  of  our  process  and  observation  models  as 
described  in  [154,160].  The  key  behind  scalable  SLAM  algorithms  in  the  canonical- form  is 
founded  upon  the  insight  that  the  information  matrix  A*  naturally  tends  to  exhibit  strong 
and  weak  constraints  as  shown  by  Fig.  3-1. 

Filtering  Causes  Fill-in 

What  Thrun  et  al.  [160]  insightfully  observed  is  that  the  time-projection  step  (i.e.,  motion 
prediction)  is  the  cause  for  creating  these  weak  constraints.  Furthermore,  by  bounding  the 
number  of  nonzero  off-diagonal  elements  linking  the  robot  to  landmarks,  many  of  these 
weak  constraints  can  be  eliminated.  Their  concept  is  to  partition  the  landmark  map  M 
into  a  set  of  active  features  m+  (i.e.,  those  with  a  nonzero  off-diagonal  element  linking  them 
to  the  robot  x*)  and  a  set  of  passive  features  m”  (i.e.,  those  with  no  link  to  x*).1  They 
showed  that  by  enforcing  an  upper  bound  on  the  number  of  active  features  m-1",  the  fill-in 
of  the  information  matrix  can  be  controlled. 

To  see  this,  consider  the  diagram  shown  in  Fig.  3-2(a).  We  begin  with  the  schematic 
shown  to  the  upper  left,  which  represents  the  robot,  x*,  at  time  t  connected  to  four  active 
landmark  map  elements,  m+  =  {mi, m2, m3, ms}.  Also  shown  is  a  single  passive  map 
element,  m“  =  {1114},  who  is  not  linked  to  the  robot,  but  only  to  another  landmark,  1115. 
For  now  we  will  ignore  how  this  connection  came  to  be  and  just  take  it  for  granted  that 
it  exists.  Conceptually,  what  Fig.  3-2(a)  represents  is  a  graphical  representation  of  the 
conditional  independencies  in  the  distribution  p(x*,  M|z*,  u*)  and  is  known  in  the  literature 
as  a  Markov  random  field  (MRF)  or  Markov  network  [122].  Therefore,  in  these  terms  we  see 
that  the  robot  to  active  landmark  connections  are  the  direct  result  of  perceptual  sensing 
and  that  the  lack  of  inter-landmark  constraints  should  be  correctly  interpreted  to  mean 
that  each  of  these  landmarks  is  conditionally  independent  given  the  robot  pose  as  described 
in  [108,  111].  The  intuition  behind  this  comes  from  viewing  the  noise  of  each  sensor  reading 
as  being  independent,  and  therefore,  determining  each  of  these  landmark  positions  is  an 
independent  estimation  problem  given  the  known  location  of  the  sensor. 

Shown  directly  below  each  Markov  network  is  an  illustration  of  the  associated  informa¬ 
tion  matrix.  Here  we  see  that  the  nonzero  off-diagonal  elements  encode  the  robot/landmark 
constraints  (i.e,  edges)  while  the  zeros  in  the  information  matrix  encode  missing  edges 
[121].  Moving  on  to  the  middle  diagram  of  Fig.  3-2(a),  we  see  that  it  represents  the  in¬ 
termediate  distribution  p(x*+i,  x*,  M|z*,  u*),  which  corresponds  to  a  time-propagation  of 
p(xt,  M|z* ,  u*)  where  we  have  augmented  our  state  vector  to  include  the  term  x*+i  rep¬ 
resenting  the  new  robot  pose  at  time  t  - hi.  Because  the  robot  state  evolves  according 
to  a  Markov  process,  the  new  robot  state  x^+i  is  only  linked  to  the  previous  robot  state 
xt.  Usually,  in  a  feature-based  SLAM  approach  only  the  current  robot  pose  is  estimated 
and  not  the  complete  trajectory.  Therefore,  we  always  marginalize  out  the  previous  robot 
pose  Xt  during  our  time-projection  step  to  yield  the  distribution  over  current  pose  and  map 
p(xt+ i,M|zi,ut)  =  / p(x*+i, X*, M|ze, u*)dxt  .  Recalling  the  formula  for  marginalization 

lrThe  total  landmark  map  is  given  by  M  = 
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Figure  3-2  A  graphical  explanation  of  SEIFs  methodology  for  controlling  sparsity  in  the  informa¬ 
tion  matrix,  (a)  A  sequence  of  illustrations  depicting  the  evolution  of  the  Markov  network  and 
corresponding  information  matrix  resulting  from  time  projection  when  viewed  as  a  two-step  process 
of  state  augmentation  followed  by  marginalization.  Gray  shades  imply  magnitude  with  white  being 
exactly  zero.  From  left  to  right  we  have:  (1)  the  robot  xt  connected  to  four  active  features,  m1;3  and 
m5;  (2)  state  augmentation  of  the  time-propagated  robot  pose  xt+i;  (3)  marginalized  distribution 
where  the  old  pose,  xt,  has  been  eliminated,  (b)  A  sequence  of  illustrations  highlighting  the  concept 
behind  sparsification.  If  feature  mi  can  first  be  made  passive  by  eliminating  its  link  to  the  old  pose, 
xt,  then  marginalization  over  xt  will  not  link  it  to  any  of  the  other  active  features.  This  implies  that 
we  can  control  fill-in  of  the  information  matrix  by  bounding  the  number  of  currently  active  features. 
Note  that  for  both  cases  the  passive  feature  m5  remains  disconnected. 


xt+i  xt  mi  m2  m3  014  xt+i  x t  mi  m2  m3  1TO4  ms  xt-j.  1  m  1  m2  m3  ITI4  ms 


(a)  Filtering  causes  fill-in  of  the  active  features. 


(b)  Bounding  the  number  of  active  features  controls  fill-in. 
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applied  to  a  Gaussian  in  the  information  form  (see  Table  3.1),  we  note  that  it  is  the  the 
outer  product  of  AQ^A^A^  that  causes  the  information  matrix  to  fill-in  and  become  dense 
as  shown  in  the  illustration  to  the  far  right  of  Fig.  3-2(a).2 


Controlling  Active  Feature  Fill-in 

Intuitively,  the  landmarks  mi,  m2,  m3,  and  1115  used  to  be  indirectly  connected  via  a  direct 
relationship  with  x*,  but  now  must  represent  their  indirect  relationship  directly  by  creating 
new  links  between  each  other.  Therefore,  the  penalty  for  a  filtered  feature-based  SLAM 
representation  is  that  the  Markov  network  becomes  fully  connected  and  the  associated 
information  matrix  becomes  fully  dense  (though  as  previously  mentioned,  the  authors  of 
[160]  make  the  empirical  observation  that  many  of  the  off-diagonal  elements  are  relatively 
small).  Insightfully,  this  understanding  of  fill-in  suggests  that  we  can  control  the  active 
feature  density  in  the  information  matrix  by  bounding  the  number  of  links  connected  to  xt 
before  marginalization  occurs  as  illustrated  by  Fig.  3-2(b).  This  key  insight  motivates  the 
concept  behind  SEIFs  sparsification  methodology,  which  is  the  process  of  how  we  remove 
links  to  satisfy  our  active  feature  bound  and  maintain  a  sparse  graph  representation.  But 
before  moving  on  to  discuss  how  SEIFs  actually  enforce  the  upper  bound  on  the  number 
of  active  features,  it  is  useful  to  first  elucidate  the  conditional  independence  relationship 
implied  by  active  and  passive  features. 


3.2.4  Robot  Conditional  Independence  from  Passive  Features 

A  very  useful  property  of  the  canonical- form  is  that  the  information  matrix  has  the  direct 
interpretation  as  a  Gaussian  Markov  random  field  (GMRF)  [168]  where  random  variables  are 
nodes,  non- zero  off-diagonal  elements  are  edges/constraints,  and  zero- valued  off-diagonal  el¬ 
ements  are  missing  edges  implying  available  conditional  independencies  [135].  Applying  this 
property  to  SEIF’s  active/ passive  partitioning  of  landmarks  we  see  that  Fig.  3-3  correctly 
illustrates  the  corresponding  block  information  matrix  and  associated  Markov  network  for 
the  SEIF  posterior.  In  particular,  Fig.  3-3  clearly  shows  that  there  is  a  missing  edge  be¬ 
tween  the  robot,  x*,  and  the  passive  features,  m“,  implying  that  the  two  are  conditionally 
independent  given  the  active  features,  m+ . 

Mathematically,  we  can  also  easily  prove  this  relationship  by  recalling  that  independence 
implies  that  the  Gaussian  conditional  posterior  p(x*,m~  |  m+,z*,u*)  must  have  a  block- 
diagonal  covariance  matrix  (i.e.,  for  a  Gaussian  random  variable,  uncorrelated  is  equivalent 
to  independent).  In  the  information  form,  conditioning  on  the  active  features,  m+,  corre¬ 
sponds  to  simply  extracting  the  {x*,m-}  sub-block  from  the  information  matrix,  At  (see 
Table  3.1).  Referring  again  to  Fig.  3-3,  we  note  that  this  sub-block  results  in  a  block- 
diagonal  conditional  information  matrix  over  xt  and  m”  whose  inverse  is  a  block-diagonal 
covariance  matrix.  Hence,  conditional  independence  is  proved.  As  we  show  next,  we  can 
exploit  this  conditional  independence  relationship  to  derive  a  sparsification  rule  that  allows 
us  to  bound  the  number  of  active  features. 


2Here  a  =  {xt+i,M}  and  0  =  x* 
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Figure  3-3  An  illustration  of  SEIF’s  concept  of  active  and  passive  features  and  their  relation  to  the 
robot,  (left)  A  schematic  of  the  block  3x3  SEIF  information  matrix.  Dark  squares  correspond  to 
nonzero  block-elements  while  white  squares  corresponds  to  exactly  zero  block  elements,  (right)  The 
SEIF  information  matrix  expressed  as  a  Markov  network.  The  missing  edge  between  x*  and  m“ 
implies  available  conditional  independence  given  m*. 


3.3  Sparsification 

In  feature-based  SLAM,  landmarks  become  active  through  observation  by  causing  them 
to  become  linked  to  the  robot  through  a  shared  off-diagonal  constraint.  Furthermore,  this 
constraint  will  decay  over  time  if  the  landmark  is  not  re-observed  [50],  but  will  never  become 
exactly  zero  (i.e.,  passive)  unless  it  is  “sparsified” .  Sparsification  refers  to  the  pruning 
operation  whereby  the  weak  robot- landmark  constraints  are  removed  causing  features  to 
be  made  passive.  It  is  a  useful  approximation  that  allows  sparsity  to  be  enforced  in  the 
information  matrix  by  bounding  the  number  of  active  features  as  described  in  §3.2.3. 

3.3.1  SEIF  Sparsification  Rule 

Sparsification  is  required  whenever  the  active  feature  threshold  is  exceeded  through  land¬ 
mark  observation.  SEIF’s  strategy  for  sparsification  is  based  upon  partitioning  the  landmark 
map,  M,  into  a  union  of  three  disjoint  sets,  M  =  {m°  U  m+  U  m-},  where  in  a  slight  abuse 
of  our  previous  notation,  m“  are  the  currently  passive  features  which  will  remain  passive 
after  sparsifying,  m+  are  the  currently  active  features  which  will  remain  active  after  sparsi- 
fying,  and  m°  are  the  currently  active  features  which  will  become  passive  after  sparsifying. 

We  begin  our  derivation  of  the  SEIF  sparsification  approximation  by  factorizing  the 
SLAM  posterior  over  the  robot  and  map  as 

p(x*,  m°,  m  f ,  m  )  —  p(xt|m°,  m+,m')p(m°,  m4  ,  m-)  (3.2a) 

=  p(x*|m°,  m"f ,  m“  =a)p(m°,m+,m").  (3.2b) 

For  notational  convenience  we  have  omitted  explicitly  writing  out  the  conditioning  on  zl 
and  u*.  The  above  factorization  invokes  the  available  conditional  independence  between 
the  robot  and  passive  features  discussed  in  §3.2.4  to  arbitrarily  assign  a  value  to  the  passive 
features  in  the  conditional  of  (3.2b)  (i.e.,  m”  =  a)  without  influencing  the  conditional  robot 
posterior.  In  other  words,  p(xt|m°,  m+,  m~")  =  p(xt|m°,  m+).  Note  that  in  the  original 
derivation  proposed  by  [160],  a  is  simply  set  to  zero  while  here  we  leave  it  as  a  free  parameter 
for  the  purposes  of  exposition. 

The  SEIF  sparsification  approximation  is  derived  from  (3.2b)  by  imposing  that  m°  be 
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passive  via  dropping  it  from  the  robot  posterior  as 


PSElFs  (x«,  m°,m+,m') 

=  p(xt|m+,m“  =  a)p( 
_  Pb(*-u  m+  |m'=a 

pc(m+  |  m-  =  a) 


m°,m+,m  ) 

(3.3a) 

^  Pd{ mH,m+,m  ). 

(3.3b) 

Here,  (3.3b)  merely  expresses  the  conditional  of  (3.3a)  as  a  ratio  and  the  subscripts  p#, 
pc,  Pd  are  used  for  notational  convenience  to  reference  the  different  pdfs  involved  in  the 
calculation.  While  the  factorization  of  (3.2b)  is  theoretically  exact  due  to  the  conditional 
independence  between  x*  and  m“  given  the  active  features,  (3.3)  is  in  error  because  xt  is 
no  longer  conditionally  independent  of  m“  given  only  a  partial  set  of  the  active  features 
(i.e.,  the  set  of  all  active  features  is  m°  U  m+).  This  implies  that  the  particular  value  of  a 
chosen  modifies  the  posterior  approximation. 

Equations  (3.4)  and  (3.5)  summarize  the  SEIF  sparsified  posterior  as  expressed  in  both 
covariance  and  information  form  respectively  —  a  complete  derivation  of  the  resulting  pos¬ 
terior  can  be  found  in  Appendix  §B.2.  For  ease  of  comparison  we  use  the  same  notation 
as  [160]  where  S  denotes  a  projection  matrix  over  the  state  space  £t  (e.g.,  x*  =  ex¬ 

tracts  the  robot  pose).  Note  that  the  mean  update  in  (3.4)  clearly  shows  that  the  original 
mean  vector  fit  is  modified  during  the  sparsification  step  for  values  of  a  ^  which 

indicates  a’s  influence  on  the  term  p(x*|m+,  m“  =  a)  used  in  the  approximation  of  (3. 3). 3 * 


Covariance  Form 

Mf  —  Mf.  +  “  Sm+  S^  +  )  flQ 

=  (Sxt,m+^£  Sxt,m+  —  Sm+  Sm+  +  Smo,m+,m-ED  Smo  m+  m-) 


where 


-l 


Mb  =  Sj,m+  {fit  +  fia) 


—  Sjtm+^SXim+  —  Sj>m+E*Sm-  (S^-E^Syyj-)  S^-E,sx>m+ 


(3.4) 


Me  =  Sm+  (Mt  +  Ma) 

EC  =  S^+EfSm+  -  S^+EtSm-  (ST_EtSm-)-1S^_EtSm+ 

MD 

^ D  =  ^Vn0,m+,m~ 

with  av  =  EtSm-(S^_EtSm-)~'(a  - 


3The  expression  for  the  sparsified  information  vector  as  presented  in  [160]  corresponds  to  setting 

a  =  pt,  (i.e.,  the  mean  of  the  passive  features)  and  not  a  =  0  as  stated  in  their  paper. 
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Information  Form 


^xtim+V B  ^m+Vc  “h  Smo  yyi+  m- T)  jj 

At  Sx  m+A#S  m+  Sm+ AcSm+  +  Smo  m+  m- A/)S 


iP  rri  +  m 


where 


(3.5) 


Vb  =  Sjt>m+  (r/,  -  77J  -  Sjt  m+A<Smo(S^oAtSmo)  ^  fat  -  ?7a) 

A#  —  SXt>m+ A*SXt7n+  —  ^Xt>m+ A*Smo  (S^oAjSmo)  S^oAiSXt  Tn+ 

^lC  (^7t  ^7a)  At^ar^m0  (^x*,m0  At^xt,rn0)  (^7t  ^7  a ) 

Ac  =  S^+A*Sm+  —  S^+ AtSXt>mo(SjtjmoAiSXt  mo)  Sjf<moAtSm+ 

D  ^m°1m+)m~  ^7t  ^m°,m+,m~  AjSXt  (SXt  A^SXt  )  ^xt^t 

Ad  —  ^mo  m+^m- A^Smo  m4->m-  —  Smo  m-h  m-  A^SXt  (SXt  A^SXt)  SXtA*Smo<m+^ 

with  rja  =  £*S m-a. 


3.3.2  Modified  Sparsification  Rule 


In  the  previous  section  we  showed  that  the  SEIF  derivation  introduced  a  conditioning  on  a 
specific  realization  of  the  passive  features  (i.e.,  m~  =  a).  This  conditioning  influences  the 
outcome  of  the  sparsification  approximation  and  in  particular  can  modify  the  resulting  mean 
estimate  as  evident  by  the  functional  dependence  of  (3.4)  on  a.  In  the  following  we  show 
that  we  can  easily  modify  the  original  SEIF  approximation  to  derive  a  more  correct  version  of 
the  sparsification  rule  by  explicitly  using  the  available  conditional  independence  relationship 
between  the  robot  and  passive  features  to  drop  m“  from  the  posterior.  This  modified 
version  of  the  SEIF  sparsification  rule  preserves  the  state  mean  and,  as  demonstrated  in  §3.4, 
provides  a  high  fidelity  approximation  that  yields  results  comparable  to  the  full-covariance 
EKF. 

We  begin  by  factorizing  the  posterior  p(x*,  m°,  m+,  m-)  using  Bayes  rule  just  like  in 
equation  (3.2a)  of  the  SEIF  derivation,  but  this  time  we  explicitly  employ  the  available 
conditional  independence  between  the  robot  and  passive  features  given  the  active  features 
to  drop  m“  from  the  posterior  over  x*  as 


p(x*,  m°,  m+ ,  m  )  =  p(x*|m°,m+,m  )p(m°,m‘f,m  ) 


='  p(xt|m°,  m+)p(m°, 
_  p(X(, m°|m+)  /  o 

p(mO\m+\  P[  ’ 


mf ,  m  ) 


mu  I  m 


m  ,  m 


-)• 


(3.6a) 

(3.6b) 

(3.6c) 


The  above  posterior  factorization  is  exact,  and  for  convenience  equation  (3.6c)  merely  re¬ 
expresses  the  conditional  of  (3.6b)  as  a  ratio.  To  obtain  the  sparsified  posterior  approxima- 
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tion,  we  now  impose  conditional  independence  between  x*  and  m°  as 


P(  i 


pv(m+) 


'(m°|m+)  o  + 

-^—4 - ~p(m  ,  m  ,  m  ) 

m+) 

(3.7a) 

(m°,  m+,  m“) 

(3.7b) 

-pD( ). 

(3.7c) 

For  convenience,  (3.7c)  simplifies  the  sparsified  posterior  to  a  ratio  of  marginals  where 
the  subscripts  pu,  pv,  po  are  used  to  notationally  reference  the  different  pdfs  involved.  As 
(3.7a)-(3.7b)  show,  sparsification  is  equivalent  to  imposing  conditional  independence,  which 
in  turn  is  equivalent  to  dropping  dependence  on  the  set  of  features  we  wish  to  deactivate 
(i.e.,  m°).  The  result  is  a  modified  sparsification  rule  summarized  by  equations  (3.8)  and 
(3.9)  expressed  in  both  covariance  and  information  form  respectively  —  see  Appendix  §B.3 
for  a  full  derivation. 


Covariance  Form 

^ t  =  (S,tim+El/  —  Sm+  T,v  Sm+  +  Smo>m+  Smom+>m- ) 

where 


-l 


Mu  =  Sxt,m+M( 

and  nD,  Ep  are  the  same  as  in  (3.4). 


Mv  —  Sm+^t 
Zv  =  S^+StSm+ 


(3.8) 


Information  Form 

V t  ~  SXt,m+Vu  “b  V D 


where 


,m+  A-U ^xt ,m+ 
rW  ^xt,m+^i  —  ®xt,m+ 

=  ^xt,m+  ”  Sxt,m+^^rn°,m~  (Smo  m- A£Smo  m- )  AtSXt  m+ 


(3.9) 


Vv  —  Sm+*7f  Sm+  A*SXtirn°,m_  (SXt  mo  m-  A*S 
Av  =  Sm+ A$Sm+  —  Sm+  AiSXt >mo  m-  (SXt  mo  m- 


xt,m°,m 


-1CT 


*lt 


AtS 


xt,m°,m 


qT 

^xt,m°,m- 


A*Sm+ 


and  r]D,  Ad  are  the  same  as  in  (3.5). 

In  particular,  note  that  equation  (3.8)  shows  that  the  modified-rule  clearly  maintains  the 
mean  estimate.  Furthermore,  careful  inspection  of  the  projection  matrices  involved  in  (3.9) 
shows  that  it  simultaneously  deactivates  the  map  features  m°  (i.e.,  SXt>m+  only  populates 
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the  robot/active  feature  sub-block  of  the  resulting  information  matrix  A*).  However,  a 
significant  drawback  (despite  its  correctness)  is  that  sparsification  is  no  longer  a  constant¬ 
time  operation  as  evident  by  the  expressions  for  Ay  and  A y  which  require  large  matrix 
inversions  over  the  passive  features  m“. 


3.3.3  What  happens  if  we  just  leave  m  alone? 


In  §3.3.1  we  showed  that  conditioning  the  SEIF  posterior  (3.3)  on  m”  =  a  influences  the 
approximation’s  outcome  which  is  an  undesirable  attribute.  However,  its  advantage  is  the 
fact  that  computing  rj,  A  is  a  constant-time  operation  since  only  a  matrix  of  the  size  of 
the  deactivated  features  m°  needs  to  be  inverted.  Section  3.3.2  then  derived  a  modified 
sparsification  rule  that  exploits  the  conditional  independence  between  xt  and  m“  to  yield  a 
sparsification  rule  that  preserves  the  distribution’s  mean  while  simultaneously  deactivating 
m°.  Unfortunately,  this  rule  is  no  longer  constant-time  (in  fact  cubic)  because  it  requires 
inverting  a  matrix  of  the  size  of  the  passive  features  m“  (this  constitutes  a  majority  of 
the  map).  Therefore,  a  natural  question  to  ask  is  what  happens  if  we  leave  m“  alone  in 
the  conditioning?  In  other  words,  suppose  that  the  sparsification  step  merely  drops  the 
dependence  of  xt  on  m°  while  leaving  m“  in  the  conditioning  —  what  will  the  result  be? 

We  begin  by  factorizing  the  original  distribution  using  Bayes  rule  as 


p(xt,m°,m+,  m  )  =  p(xt|m°,m+,m  )p(m(\  m+ ,  m  ) 

p(x(,m°|m+,m_) 


p(m°|m+,  m“) 


— r — p(m°,m+,m  ). 


(3.10a) 

(3.10b) 


Next  we  drop  x*’s  dependence  on  m°  by  forcing  xt  and  m°  to  be  conditionally  independent 
given  and  rn~  as 


P  (xt,  m°,  m+ 


m  ) 


_  p(x(|m+,m  )p(m°|m+ 

p(m°|m+,  m~) 
=  p(xf|m+,m_)p(m°,m+ 


’m  )  /  o  + 
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,m~). 


(3.11a) 

(3.11b) 


After  working  through  the  math  the  resulting  distribution  in  canonical  form  is  given  by 


Vt.  Sm+  m-  TJq  *+-  Smo  jn+  ,m“  V D 
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Figure  3-4  Leaving  m~  alone  in  the  conditioning  reactivates  passive  features,  (a)  The  Markov 
network  associated  with  the  SEIF  posterior,  (b)  The  Markov  network  after  sparsification  by  (3.10). 
The  approximation  forces  xt  and  m°  to  be  conditionally  independent  given  m-  Um+,  but  doesn’t 
constrain  x*  and  m~  to  maintain  a  sparse  relationship. 


and  r)D,  Ap  are  the  same  as  in  (3.5). 

Close  inspection  of  the  sparsified  information  matrix,  A*,  shows  that  we  have  successfully 
deactivated  m°  by  setting  the  shared  information  between  x*  and  m°  to  zero  (i.e.,  none  of  the 
projection  matrices  in  (3.12)  extract  them  together).  However,  further  inspection  of  (3.12) 
shows  that,  in  practice,  this  approximation  is  not  very  useful  because  the  first  term  in  the 
expression  (i.e.,  SXt>m+  m- ApSjt  m+  m_)  introduces  new  links  into  the  information  matrix. 
These  links  occur  between  the  robot,  x*,  the  active  features,  m+,  and  the  passive  feature, 
m~,  and  will  destroy  our  sparsity  at  the  next  time-projection  step.  Intuitively,  the  sparsified 
posterior  (3.11a)  forces  x*  and  m°  to  be  conditionally  independent  given  m+  Urn-,  but 
doesn’t  enforce  any  type  of  sparsity  constraint  between  x*  and  m“.  Therefore,  the  Markov 
network  is  allowed  to  reorganize  itself  into  the  sparsified  graph  depicted  in  Fig.  3-4.  So 
while  we  managed  to  remove  a  small  number  of  links  from  the  robot,  x*,  to  the  deactivated 
features,  m°,  we  have  reactivated  many  more  links  between  it  and  the  passive  features,  m~. 
As  a  result,  these  reactivated  links  will  now  cause  the  information  matrix  to  densify  during 
the  next  robot  motion  time-prediction  (§3.2.3). 


3.4  Results 


This  section  investigates  the  implications  of  the  two  different  sparsification  rules  by  com¬ 
paring  the  results  of  the  sparsified  information  filters  to  that  of  the  standard  Kalman  for¬ 
mulation.4  In  the  first  scenario,  §3.4.1,  we  consider  a  linear  Gaussian  (LG)  SLAM  simulation 
where  the  Kalman  filter  (KF)  is  the  optimal  Bayes  estimator  and  provides  a  benchmark 
measure  against  which  to  compare  the  different  sparsification  routines.  Subsequently,  in 
the  second  scenario,  §3.4.2,  we  test  the  algorithms  on  a  nonlinear  real-world  dataset  to 
understand  their  performance  in  practice. 


4These  results  were  produced  jointly  with  Matthew  Walter  [39]. 
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3.4.1  Simulation:  LG  SLAM 
Experimental  Setup 

In  an  effort  to  compare  the  effects  of  the  two  sparsification  strategies  in  a  controlled  manner, 
we  applied  the  three  estimators  (i.e.,  SEIF,  modified-rule,  and  KF)  to  a  synthetic,  linear,  2D 
dataset.  For  this  simulation,  vehicle  motion  was  constrained  to  be  purely  translational  (fixed 
orientation)  and  was  generated  by  a  linear,  constant-velocity  process  model  corrupted  by 
additive  white  Gaussian  noise.  As  the  robot  moved  around  in  the  environment,  it  measured 
the  relative  position  of  local  point  features,  again  perturbed  by  white  noise,  and  had  an 
active  feature  bound  set  at  10%  of  the  total  number  of  landmarks  in  the  environment. 
Since  the  simulation  was  restricted  to  LG  SLAM,  we  can  compare  the  effects  of  the  two 
sparsification  routines  relative  to  the  optimal  (KF)  solution. 

Experimental  Results 

To  test  the  consistency  between  the  different  filter  covariances  and  the  true  state  estimation 
errors,  we  use  the  normalized  estimation  error  squared  (NEES)  [7,  §5.4.2]  as  computed  based 
upon  a  series  of  Monte  Carlo  simulations  and  two  different  error  metrics.  The  first  metric 
compares  the  direct  output  of  the  filters  to  ground-truth  and  provides  a  measure  of  global 
error.  The  second  metric  computes  the  state  estimate  relative  to  the  first  feature  that  was 
observed,  denoted  xm,  via  the  standard  compounding  operation:  xmj  =  ©xm  ©  x,  [154], 
which  provides  a  measure  of  local/relative  error.  Fig.  3-5  compares  the  two  error  metrics 
for  the  vehicle  position  NEES  score  for  the  KF  and  information  filters.  Similarly,  Fig.  3-6 
shows  the  normalized  errors  for  a  single  map  feature  and  is  representative  of  the  performance 
for  other  map  elements.  The  horizontal  threshold  in  both  plots  signifies  the  97.5%  upper 
bound  for  the  chi-square  test.  Comparing  the  estimate  of  vehicle  and  map  positions  in  the 
world  frame,  the  modified-rule  yields  errors  nearly  identical  to  those  of  the  KF,  not  only 
in  regards  to  magnitude,  but  also  in  behavior  over  time.  In  comparison,  the  SEIF  global 
errors  are  noticeably  larger,  though  in  contrast,  the  normalized  relative  errors  are  roughly 
equivalent  to  those  of  the  KF  and  modified-rule.  This  apparent  discrepancy  indicatives  that 
the  relative  map  estimates  for  all  three  filters  have  converged  while  the  global  SEIF  estimate 
is  inconsistent. 

We  gain  further  insight  into  the  consequences  of  sparsification  by  looking  at  the  co- 
variances  associated  with  each  filter.  Fig.  3-7  depicts  a  histogram  comparing  the  ratio  of 
determinants  for  the  global  and  relative  KF  feature  covariances  to  those  of  the  two  informa¬ 
tion  filters.  To  aid  in  interpreting  the  ratio  as  a  metric,  values  of  one  represent  ideal,  while 
those  larger  than  one  indicate  the  amount  of  overconfidence.  For  both  information  filters, 
the  uncertainty  measures  are  overconfident  with  respect  to  those  of  the  optimal  KF,  and  in 
turn  are  inconsistent  with  the  true  estimation  error.  However,  the  difference  in  magnitude 
between  the  modified-rule’s  confidence  regions  to  those  of  the  standard  KF  are  nearly  neg¬ 
ligible  in  both  a  global  and  local  sense.  In  contrast  the  SEIF  rule  has  absolute  uncertainty 
that  is  significantly  more  overconfident.5  Meanwhile,  upon  referencing  the  state  estimates 
relative  to  the  first  observed  feature,  the  SEIF  covariance  matrix  now  reflects  nearly  the 

5The  exception  is  with  the  first  feature  added  to  the  map  which  is  the  source  of  the  outliers  shown  in  the 
plots  in  Fig.  3-7(a). 
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Figure  3-5  The  normalized  estimation  error  squared  (NEES)  for  the  vehicle  as  computed  based  upon 
20  linear  Gaussian  Monte  Carlo  simulations  [39].  The  horizontal  dotted-line  signifies  the  the  97.5% 
chi-square  upper  bound  in  both  plots.  The  error  shown  in  (a)  corresponds  to  a  direct  comparison 
of  the  filter  estimates  to  the  ground-truth,  and  represents  a  measure  of  global  consistency.  In  (b) 
we  plot  the  local  normalized  error  computed  relative  to  the  first  feature  instantiated  in  the  map 
(i.e.,  xmi  =  ©xm  0Xj).  SEIF’s  global  normalized  error  is  larger  as  a  result  of  an  absolute  state 
that  is  significantly  overconfident.  On  the  other  hand,  the  relative  map  error  is  nearly  identical  to 
that  of  the  modified-rule  and  KF,  which  empirically  indicates  that  the  SEIF  yields  locally  consistent 
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(a)  Global  NEES  for  the  vehicle. 
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Figure  3-6  The  NEES  for  a  representative  feature.  See  the  caption  of  Fig.  3-5  for  a  description. 


Time 


(a)  Global  NEES  for  a  feature. 
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same  level  of  uncertainty  as  the  KF  and  modified-rule.  In  the  process  of  root-shifting  the 
map  to  the  first  feature,  the  original  world  origin  now  becomes  included  as  a  state  element. 
Note  that  while  the  world  origin  uncertainty  estimate  for  the  modified-rule  agrees  with  those 
of  the  rest  of  the  relative  map  estimates,  the  same  is  not  true  for  the  SEIF’s  uncertainty 
measure  of  the  world-origin  as  indicated  by  the  outlier  in  Fig.  3-7(b).  This  suggests  that 
while  the  relative  SEIF  map  estimate  has  converged,  its  estimate  of  the  global  world  origin 
remains  inconsistent. 

The  effect  of  sparsification  on  the  covariance  estimates  is  consistent  with  what  is  observed 
with  the  normalized  errors.  In  other  words,  though  there  is  little  difference  between  the 
three  sets  of  feature  position  estimates,  the  errors  for  the  global  SEIF  map  are  much  larger 
due  to  overconfidence  in  its  state  estimates.  However  upon  root-shifting,  the  difference  in 
error  between  the  different  relative  maps  becomes  negligible. 


3.4.2  Experimental  Validation:  MIT  Tennis  Courts 

While  linear  simulations  are  helpful  for  investigating  our  findings,  more  often  than  not, 
real-world  SLAM  problems  involve  nonlinear  vehicle  motion  and  perception  models.  Fur¬ 
thermore,  real  data  often  includes  noise  that  is  not  truly  Gaussian.  For  these  reasons,  we 
tested  the  estimation  algorithms  on  a  typical,  nonlinear  dataset. 


Experimental  Setup 

As  depicted  in  Fig.  3-8,  a  wheeled  robot  was  manually  driven  around  an  environment  con¬ 
sisting  of  64  track  hurdles  positioned  on  four  adjacent  tennis  courts,  which  provide  ground- 
truth.  The  vehicle  observed  the  environment  using  a  SICK  laser  range  finder  and  was 
equipped  with  wheel  encoders  for  determining  the  motion  control  inputs.  Data  association 
(the  problem  of  correctly  pairing  measurement  data  with  the  corresponding  hurdle)  was 
addressed  offline  [167]  and,  thus,  is  identical  for  each  SLAM  filter.  In  our  feature-based 
representation,  each  hurdle  serves  as  an  individual  coordinate  frame  parameterized  by  a 
base  leg  position  and  its  orientation. 


Experimental  Results 

An  EKF  was  applied  alongside  the  two  different  information  filters  who  each  had  an  active 
feature  bound  of  10  landmarks.  The  resulting  SLAM  maps  are  shown  in  Fig.  3-9  and 
are  seen  to  exhibit  very  similar  behavior  to  the  LG  SLAM  results  (i.e.,  that  SEIF  has  a 
contrasting  global/relative  performance).  Fig.  3-9(a)  plots  the  SLAM  maps  in  terms  of  the 
global  state  representation  where  we  see  that  enforcing  sparsity  with  the  modified-rule  leads 
to  a  negligible  difference  from  the  EKF.  However,  like  in  the  LG  simulation,  the  SEIF  yields 
global  map  estimates  that  axe  inconsistent,  since  a  majority  of  the  true  hurdle  positions 
fall  well  outside  the  3-sigma  uncertainty  regions.  Meanwhile,  as  depicted  in  Fig.  3-9(b), 
root-shifting  the  map  relative  to  the  first  instantiated  feature  yields  a  relative  posterior 
very  similar  to  the  EKF. 
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Figure  3-7  LG  SLAM  simulation:  comparison  of  the  SEIF  and  modified-rule  filter  covariances  to 
the  optimal  KF.  For  each  map  element  we  compute  the  ratio  of  the  KF  feature  determinant  to  the 
information  filter  (IF)  feature  determinant  —  hence,  ratios  greater  than  1  indicate  by  how  much 
each  IF  is  overconfident.  To  facilitate  comparison  of  all  map  elements,  these  ratios  are  displayed 
as  histograms,  (a)  The  uncertainties  are  computed  directly  from  the  covariances  maintained  by 
the  filters.  Note  that  the  outlier  in  both  histograms  (entry  nearest  to  one)  corresponds  to  the  first 
mapped  feature,  (b)  The  uncertainties  are  computed  for  a  relative  map  w.r.t.  to  the  first  observed 
feature.  Though  both  sparsification  rules  are  overconfident,  note  that  the  modified-rule  is  nearly 
equivalent  to  the  KF  (ratios  «  1)  for  both  the  global  and  relative  maps.  Meanwhile,  the  global  SEIF 
estimates  are  significantly  overconfident  while  the  relative  estimates  are  approximately  equivalent 
to  the  KF.  In  addition,  the  outlier  in  the  SEIF  histogram  corresponds  to  the  original  world  origin  as 
represented  in  the  root-shifted  reference  frame  —  its  significant  overconfidence  is  a  direct  consequence 
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(a)  Ratio  of  global  map  determinants. 


Modified  Rule  SEIF 


(b)  Ratio  of  relative  map  determinants. 
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Figure  3-8  The  experimental  setup  used  to  collect  the  hurdles  dataset  on  the  MIT  tennis  court. 
Ground-truth  is  determined  from  the  court  baselines. 


3.5  Discussion 


While  the  modified-rule  was  shown  to  outperform  the  SEIF  by  producing  error  estimates 
nearly  identical  in  both  a  global  and  local  sense  to  those  of  the  standard  Kalman  formulation, 
close  inspection  reveals  that  its  estimates  are  still  slightly  overconfident  with  respect  to  the 
KF.  This  section  seeks  to  explain  the  cause  of  this  inconsistency. 


3.5.1  Imposing  Conditional  Independence  Leads  to  Inconsistency 


Instructively,  it  can  be  shown  that  the  overconfidence  is  a  direct  result  of  imposing  condi¬ 
tional  independence  between  the  robot  and  the  deactivated  features,  m°.  To  illustrate  this, 
consider  the  general  three  state  distribution  p(a,b,c)  =  p(a\b,  c)p(b,  c),  and  its  sparsified 
approximation  where  any  possible  dependence  between  a  and  b  is  ignored: 

p(a,  6,  c)  =  p(a\c)p(b,  c ).  (3.13) 


To  understand  the  effect  of  this  approximation  on  LG  SLAM,  suppose  that  the  true  distri¬ 
bution  in  covariance  form  is  given  by 
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Figure  3-9  Comparison  of  the  EKF  and  information  filter  SLAM  maps  with  ground-truth  (cross¬ 
hairs)  for  the  hurdles  experimental  dataset  (from  Eustice,  Waiter,  and  Leonard  [39]).  The  plots 
in  (a)  correspond  to  the  global  feature  poses  as  directly  estimated  by  the  three  SLAM  algorithms 
together  with  the  3-sigma  confidence  bounds.  Shown  in  (b)  are  the  relative  maps  and  corresponding 
3-sigma  uncertainty  ellipses  transformed  relative  to  the  first  hurdle  added  to  the  map.  As  indicated 
by  the  right- most  plot  of  (a),  the  SEIF  maintains  global  feature  estimates  that  are  significantly 
overconfident.  Meanwhile,  the  modified-rule  yields  estimates  for  global  feature  pose  and  uncertainty 
that  are  nearly  identical  to  those  of  the  EKF.  Considering  the  relative  maps  (b),  the  two  sparsified 
filters  perform  similarly  to  the  EKF. 
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(a)  Global  maps. 
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(b)  Relative  maps. 
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where  pab,  pac,  and  pbc  are  the  normalized  correlation  coefficients.  Applying  the  sparsifica- 
tion  approximation  of  (3.13)  yields 
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A  necessary  and  sufficient  condition  for  consistency  is  that  the  covariance  matrices  must 
obey  the  inequality,  E  —  E  >  0  [23].  To  test  whether  or  not  this  condition  is  true  for  (3.15), 
we  note  that  a  sufficient  condition  test  for  positive  semi-definiteness  is  that  the  determinant 
of  all  upper  left  sub-matrices  must  be  positive  [156].  Applying  this  test  to  (3.15)  we  get 
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(3.16) 


where  in  general  the  determinant  of  the  upper  left  2  x  2  of  (3.16)  is  less  than  zero  for 
PacPbc  7^  Pab •  Hence,  extending  this  insight  to  both  the  modified  and  original  SEIFs  rule, 
we  see  that  imposing  conditional  independence  between  the  robot  and  the  deactivated 
features  results  in  an  approximation  that  is  inconsistent.  Therefore,  while  the  modified-rule 
estimates  are  comparable  to  the  KF,  this  explains  the  cause  of  their  slight  overconfidence. 


3.6  Chapter  Summary 

In  conclusion,  recent  novel  insights  into  the  canonical  formulation  of  SLAM  have  revealed 
that  sparseness  is  a  “natural”  characteristic  of  the  information  parameterization.  This  has 
lead  to  promising  new  research  into  scalable  algorithms  based  upon  pruning  relatively  weak 
constraints  in  the  information  form  to  achieve  sparsity.  The  delicate  issue,  however,  is  “how 
to  approximate  the  posterior  with  an  exactly  sparse  representation  in  a  consistent  manner?” 

In  this  chapter  we  demonstrated  that  the  constant-time  SEIF  method  of  enforcing  spar¬ 
sity  leads  to  an  inconsistent  global  map,  however,  empirical  testing  indicates  that  the  relative 
map  relationships  are  preserved.  We  also  showed  that  by  exploiting  the  conditional  inde¬ 
pendence  between  the  robot  and  the  passive  features  given  the  active  map,  that  a  new 
alternative  version  of  the  SEIF  sparsification  rule  can  be  derived.  This  modified-rule  yields 
a  sparsified  posterior  comparable  to  that  of  the  EKF  for  both  global  and  relative  maps,  but 
unfortunately  this  accuracy  comes  at  an  impractical  cost  since  it  requires  matrix  inversion 
over  the  passive  features,  which  implies  cubic-time. 

In  addtion,  we  also  noted  that  LG  SLAM  simulation  results  indicated  that  sparsification 
leads  to  an  overconfident  state  estimate.  We  investigated  the  cause  of  this  inconsistency 
in  §3.5  and  concluded  that  this  overconfidence  is  a  direct  result  of  imposing  conditional 
independence  between  state  variables.  Hence,  this  suggests  that  both  a  consistent  and  com¬ 
putationally  efficient  approximation  for  imposing  sparsity  in  feature-based  SLAM  remains 
an  open  research  task.  Insightfully,  Chapter  4  will  show  that  there  exists  an  alternative 
case  where  exact  sparsity  can  be  achieved  without  approximation,  and  that  this  case  is 
view- based  SLAM. 
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CHAPTER  4 


Exactly  Sparse  Delayed-State  Filters 


rpc 

for 
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chapter  presents  the  novel  insight  that  the  SLAM  information  matrix  is  exactly  sparse 
a  view-based  framework.  A  view-based  representation  relies  upon  scan-matching  raw 
sensor  data  and  maintaining  an  associated  collection  of  delayed  robot  states  such  that 
registration  results  in  virtual  observations  of  robot  motion  with  respect  to  a  place  it  has 
previously  been  (Chapter  2).  Within  this  framework,  the  exact  sparsity  of  the  delayed-state 
information  matrix  is  in  contrast  to  the  approximate  sparsity  of  other  recent  feature-based 
SLAM  information  algorithms  (Chapter  3).  Furthermore,  this  exact  sparsity  is  a  direct 
result  of  the  Markov  property  of  the  process  model  coupled  with  the  fact  that  the  Gaussian 
canonical  form  encodes  constraints  between  random  variables  —  we  call  this  result  “exactly 
sparse  delayed-state  filters  (ESDFs)”.  The  benefit  is  that  a  delayed-state  framework  can 
take  advantage  of  the  sparse  information  matrix  parameterization  without  having  to  incur 
any  approximation  error.  Therefore,  we  can  use  ESDFs  for  our  VAN  methodology  to  produce 
equivalent  results  to  the  “full-covariance”  EKF  solution,  however,  at  only  O(n)  cost. 

Another  novel  contribution  that  we  present  is  a  clever  strategy  for  efficiently  accessing 
and  maintaining  consistent  marginal  covariances  within  a  SLAM  information  filter,  thereby 
greatly  increasing  the  reliability  of  data  association.  Since  naive  access  to  the  covariance 
matrix  requires  matrix  inversion  (a  cubic  operation),  we  have  developed  a  novel  technique 
based  upon  solving  a  sparse  system  of  linear  equations  coupled  with  the  application  of 
constant-time  Kalman  updates.  Essentially,  our  strategy  maintains  estimates  of  the  covari¬ 
ance  block-diagonal  and,  in  addition,  the  robot’s  cross-correlation  to  these  elements.  As 
we  show,  this  technique  produces  consistent  covariance  estimates  suitable  for  robot  plan¬ 
ning  and  data  association  —  something  that  has  been  an  open  research  issue  for  all  SLAM 
information  filters. 


4 . 1  Background 

To  our  knowledge,  the  earliest  related  work  that  exploited  the  efficiency  of  the  measurement 
update  in  the  inverse  covariance  form  was  published  by  McLauchlan  and  Murray  [103],  in  the 
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context  of  recursive  structure- from-motion  (SFM).  This  work  was  subsequently  extended  to 
realize  a  hybrid  batch/ recursive  visual  SLAM  implementation  that  unified  recursive  SLAM 
and  bundle  adjustment  [105].  McLauchlan  recognized  the  potential  increase  in  efficiency 
that  can  be  gained  via  approximations  to  maintain  sparsity  of  the  information  matrix: 


It  has  long  been  known  in  the  photogrammetry  community,  in  the  form  of  the 
equivalent  normal  formulation,  that  the  [information]  matrix  . . .  takes  a  special 
sparse  form  in  the  context  of  reconstruction - [However,  in  a  recursive  formu¬ 

lation]  . . .  eliminating  motion  fills  in  the  structure  blocks.  This  has  to  be  avoided 
to  maintain  update  times  proportional  to  n.  So  our  partial  elimination  adjust¬ 
ment  method  is  to  ignore  corrections  that  fill-in  zero  blocks,  while  applying  the 
correction  to  the  blocks  which  are  already  non-zero. 


While  the  consistency  implications  of  this  approximation  are  unknown,  in  practice  the 
method  achieved  results  approaching  those  of  a  full  batch  solution  for  moderate  duration 
image  sequences. 

Recently,  the  SLAM  community  has  also  turned  its  attention  to  exploring  the  informa¬ 
tion  parameterization  for  increased  efficiency.  In  particular,  Chapter  3  discussed  a  new  class 
of  scalable  feature-based  SLAM  algorithms  founded  upon  representing  the  posterior  in  the 
canonical  form.  Since  the  information  form  naturally  yields  an  “almost  sparse”  represen¬ 
tation  (whereby  the  robot-landmark  information  matrix  is  dominated  by  a  relatively  few 
number  of  entries),  these  new  algorithms  achieve  scalability  by  eliminating  the  weak  en¬ 
tries  and  exploiting  the  remaining  sparse  representation  (e.g.,  sparse  extended  information 
filters  (SEIFs)  [160],  Thin  Junction- Tree  Filters  [121],  and  Treemap  Filters  [49]).  However, 
enforcing  sparsity  is  necessarily  an  approximation ,  which  (as  we  saw  in  the  case  of  SEIFs) 
can  lead  to  map  estimate  inconsistency. 

Interestingly,  it  is  the  same  phenomenon  that  plagues  both  the  information  formulations 
of  McLauchlan  and  Murray  [103, 105],  and  the  feature-based  SLAM  algorithms  of  Thrun  et 
al.  [160],  Paskin  [121],  and  Frese  [49]  —  and  that  is  that  “eliminating  motion  fills  in  the 
structure  blocks.”  In  §3.2.3  we  discussed  this  concept  in  depth  and  illustrated  how  elim¬ 
inating  the  robot’s  trajectory  causes  the  SLAM  landmark  posterior  to  density.  This  fill-in 
destroys  sparsity  and,  hence,  any  resulting  efficiency  associated  with  a  sparse  representa¬ 
tion.  This  is  the  reason  why  all  feature-based  SLAM  information  algorithms  are  founded 
upon  some  type  of  pruning  strategy  that  removes  weak  constraints.  Fortunately,  because 
we  do  maintain  samples  from  the  robot’s  trajectory  in  our  VAN  framework,  and  because  the 
information  matrix  represents  constraints  among  random  variables,  our  view-based  repre¬ 
sentation  is  intrinsically  sparse.  Hence,  VAN  can  exploit  the  efficiency  of  SLAM  information 
filters  in  a  consistent  manner. 

In  the  following,  we  present  both  our  view-based  SLAM  information  framework  (§4.2) 
and  our  conservative  data  association  technique  (§4.3).  Benchmark  results  (§4.5)  quanti¬ 
fying  the  view-based  information  filter  efficiency  with  respect  to  the  standard  EKF  VAN 
formulation  of  Chapter  2  are  shown  for  the  JHU  dataset.  In  addition,  real-world  results  are 
presented  for  the  largest  visually- navigated  underwater  dataset  to  date  using  data  from  a 
recent  ROV  survey  of  the  wreck  of  the  RMS  Titanic. 
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4.2  Filter  Mechanics 

4.2.1  State  Augmentation 

We  begin  by  describing  the  method  of  state  augmentation,  which  is  how  we  “grow”  the 
state  vector  to  contain  a  new  delayed-state.  This  operation  occurs  whenever  we  have  a  new 
“view”  that  we  wish  to  store.  For  example,  in  our  VAN  framework  we  add  a  delayed-state 
for  each  acquired  image  of  the  environment  that  we  wish  to  be  able  to  revisit  at  a  later 
time. 


Adding  a  Delayed-State 

Assume  for  the  moment  that  our  estimate  at  time  t  is  described  by  the  distribution  given 
below  expressed  in  both  covariance  and  information  form. 


^(xt.MlzSu*)  =  Af( 


Mm 


Zxtxt 

^  A ixt 


^•XtM 
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This  distribution  represents  a  map  and  current  robot  state,  M  and  x*  respectively,  given 
all  measurements,  z*,  and  control  inputs,  u*.  Here  the  map  variable  M  is  used  in  a  general 
sense,  for  example  it  could  represent  a  collection  of  delayed-states  or  a  set  of  landmark 
features  in  the  environment.  For  now  we  don’t  care,  because  we  want  to  show  what  happens 
when  we  augment  our  representation  to  include  the  time-propagated  robot  state,  x*+j, 
obtaining  the  distribution  p(xt+i,xi,M|zt,ut4'1),  which  can  be  factored  as 


p(xt+i,xt,M|zt,ut+1)  =  p(xt+i|xt,M,zt,ut+1)p(xt,M|zt,ut+1) 

M=kov  p(xt+i|xt,ut+i)p(xt,M|z‘,ut).  (4.1) 


In  (4.1)  we  factored  the  posterior  into  the  product  of  a  probabilistic  state-transition  multi¬ 
plied  by  our  prior  using  the  common  assumption  that  the  robot  state  evolves  according  to 
a  first-order  Markov  process.  Equation  (4.2)  describes  the  general  nonlinear  discrete-time 
Markov  robot  motion  model  we  assume  and  (4.3)  its  first-order  linearized  form  where  F  is 
the  Jacobian  evaluated  at  nXt  and  w(  ~  Af(0,  Q)  is  the  white  process  noise.1 


xt+i  =  f(xt,u(+i)  +  wf 

~  f(AtXt,ut+1)  +  F(xf  -  nXt)  +  wt 


(4.2) 

(4.3) 


Note  that  in  general  our  robot  state  description,  Xt,  consists  of  both  pose  (i.e.,  position  and 
orientation)  and  kinematic  components  (e.g.,  body-frame  velocities,  angular  rates). 


Augmentation  in  the  Covariance  Form 

Under  the  linearized  approximation  of  (4.3),  the  augmented  state  distribution  of  (4.1)  is 
also  Gaussian,  and  in  covariance  form  its  result  is  given  by  [155]: 

p(xt+i,x(,  Mlz1,  u£+1)  =  Af{ii't+i,  S't+1) 

'See  Appendix  §A.3  for  a  description  of  our  discrete-time  vehicle  model. 
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The  lower-right  2x2  sub-block  of  corresponds  to  the  covariance  between  the  delayed 
robot  state  element,  x*,  and  the  map,  M,  and  has  remained  unchanged  from  the  prior. 
Meanwhile,  the  first  row  and  column  contain  the  cross-covariances  associated  with  the  time 
propagated  robot  state,  x*+ 1,  which  includes  the  effect  of  the  process  model. 


Augmentation  in  the  Information  Form 

Having  obtained  the  delayed-state  distribution  in  covariance  form,  we  can  now  transform 
(4.4)  to  its  information  form  (4.5).  This  requires  inversion  of  the  3x3  block  covariance 
matrix  whose  tedious  derivation  we  omit  here,  though,  note  that  (4.5)  can  be  verified 
by  the  fact  that  A'(+1£'(+1  =  I  and  r?'(+1  =  A{+1/x't+1. 

p(xt+i,  xt,  M|z£,  ut+1)  =  A/"-1  (Vt+i.  A't+i) 

@1 

A  xtM 

A  MM 

(4.5) 
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Markovity  Yields  Exact  Sparseness 

Equation  (4.5)  yields  a  key  insight  into  the  structure  of  the  information  matrix  regarding 
delayed-states.  We  see  that  augmenting  our  state  vector  to  include  the  time-propagated 
robot  state,  x*+i,  introduces  shared  information  only  between  it  and  the  previous  robot 
state,  xt.  Moreover,  the  shared  information  between  x*+i  and  the  map,  M,  is  always 
zero  irrespective  of  what  M  abstractly  represents  (i.e.,  regardless  of  whether  M  represents 
a  set  of  landmarks  or  a  collection  of  delayed-states,  the  result  will  always  be  zero).  This 
sparsity  in  the  augmented  state  information  matrix  is  a  consequence  of  the  Markov  property 
associated  with  the  state  transition  probability  p(x*+i |x*, u*+i).  In  terms  of  a  Markov 
network  [122],  x*+i  is  only  serially  connected  to  its  parent  node,  xt,  and  therefore,  is 
conditionally  independent  of  M. 

By  induction,  a  key  property  of  state  augmentation  in  the  information  form  is  that 
if  we  continue  to  augment  our  state  vector  with  additional  delayed-states,  the  information 
matrix  will  exhibit  a  block  tridiagonal  structure  linking  each  delayed-state  with  the  post  and 
previous  states  as  shown  in  (4.6).  Hence,  the  view-based  delayed-state  SLAM  information 
matrix  is  naturally  sparse  without  having  to  make  any  approximations. 

AXt+ixt+i  AXt+lXt 

At  a 

ivxt+l Xt  uvXtXt 


'XtXt-  1 


* Xt-\Xt-\ 


VXt-\Xt-2 


(4.6) 
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4.2.2  Measurement  Updates 

One  of  the  very  attractive  properties  of  the  information  form  is  that  measurement  updates 
are  constant-time  and  additive  in  an  extended  information  filter  (EIF)  [160]  —  this  is  in 
contrast  to  an  EKF’s  quadratic  complexity  per  update,  which  plagued  our  VAN  methodology 
of  Chapter  2.  Assume  the  following  general  nonlinear  measurement  function  (4.7)  and  its 
first-order  linearized  form  (4.8): 


Zt  =  h(£t)  +  v<  (4'7) 

«h(At)  +  H(&-£t)  +  vt  (4.8) 

where  is  the  predicted  state  vector  distributed  according  to  ~  Et)  =  Af~l  (r)t,  At) , 

v(  is  the  white  measurement  noise  vt  ~  A/"(0,  R),  and  H  is  the  Jacobian  evaluated  at  p,t. 
The  EKF  covariance  update  requires  computing  the  Kalman  gain  and  updating  fi,  and  E( 
via  [7]: 


K  =  E£HT(HEtHT  +  R)_1 

Ht  =  fit  +  K(zt  -  h(p.t))  (4.9) 

E£  =  (I  -  KH)Et(l  -  KH)T  +  KRKt. 

This  calculation  non-trivially  modifies  all  elements  in  the  covariance  matrix  resulting  in 
quadratic  computational  complexity  per  update  [154].  In  contrast,  the  corresponding  EIF 
update  is  given  by  [160]: 


At  =  At  +  HtR_1H 

Vt  =  Vt  +  HTR_1(zt  -  h(p.t)  +  H/if). 


ESDF  Updates  are  Constant-Time 

Equation  (4.10)  shows  that  the  information  matrix  is  additively  updated  by  the  outer 
product  term  HTR-1H.  In  general,  this  outer  product  modifies  all  elements  of  the  predicted 
information  matrix,  A*,  however,  a  key  observation  is  that  the  SLAM  Jacobian,  H,  is  always 
sparse  [160].  For  example,  in  the  VAN  framework  of  Chapter  2,  pairwise  registration  of 
images  I{  and  Ij  provides  a  relative-pose  measurement  between  states  x»  and  Xj  resulting 
in  a  sparse  Jacobian  of  the  form: 


H  = 


dh 

dxt 


ah 


d'x.j 


As  a  result,  only  the  four  block-elements  corresponding  to  x*  and  Xj  of  the  information 
matrix  need  to  be  modified.  In  particular,  the  information  in  the  diagonal  blocks  AXiXi 
and  AXjXj  is  increased,  while  new  information  appears  at  AXiXj  and  its  symmetric  counter¬ 
part  AXjXi.  This  new  off-diagonal  information  reflects  the  addition  of  a  new  edge  into  the 
corresponding  Markov  network  linking  the  nodes  x*  and  Xj. 

Putting  (4.6)  together  with  (4.10),  we  see  that  an  important  consequence  of  the  delayed- 
state  framework  is  that  the  total  number  of  nonzero  off-diagonal  elements  in  the  information 
matrix  is  linear  in  the  number  of  state  elements  and  relative-pose  constraints  (i.e.,  cam- 
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Figure  4-1  View-based  SLAM  is  exactly  sparse.  This  figure  highlights  the  exact  sparsity  of  the 
view-based  SLAM  information  matrix  using  data  from  a  recent  ROV  survey  of  the  wreck  of  the 
RMS  Titanic.  In  all  there  are  867  delayed-states  where  each  state  is  a  12- vector  consisting  of  6-pose 
and  6-kinematic  components.  The  resulting  information  matrix  is  a  10,404  x  10,404  matrix  with 
only  0.52%  nonzero  elements. 


(a)  Resulting  information  matrix  for  the  RMS  Titanic  survey. 

camera 


camera  camera 

(b)  Conceptual  illustration  of  the  measurement  constraints  (Markov  network). 


era  measurements)  for  a  bounded  graph  structure.  Hence,  without  any  approximation,  a 
view-based  representation  is  exactly  sparse  and,  furthermore,  requires  only  linear  storage 
(Fig.  4-1).  In  our  application,  we  control  the  degree  of  sparisty  by  bounding  the  number  of 
image  registrations  that  the  robot  may  attempt  per  state  augmentation.  In  other  words, 
the  robot  is  only  allowed  to  hypothesize  k  possible  candidate  images  (where  k  =  5  in  our 
application)  for  attempted  registration  with  the  current  view;  this  leads  to  at  most  2 nk 
non-Markov  off-diagonal  constraints  in  the  resulting  information  matrix. 

As  a  side  note,  it  is  worth  pointing  out  that  (4.7)  assumes  that  the  measurements  are 
corrupted  by  time  independent  noise.  Since  scan-matching  methods  rely  upon  registering 
raw  data,  this  criteria  may  be  violated  if  data  is  reused.  In  our  VAN  application,  relative- 
pose  measurements  are  generated  by  pairwise  registration  of  images  with  common  overlap. 
As  we  showed  in  §2.4.4,  this  produces  motion  estimates  that  are  weakly  (if  at  all)  correlated 
for  our  AUV  application.  However,  for  the  general  case,  measurement  independence  should 
be  ensured  by  only  using  a  set  of  raw  data  correspondences  once,  so  that  scan-matching 

110 


measurements  remain  independent. 


4.2.3  Motion  Prediction 


Motion  prediction  corresponds  to  a  time  propagation  of  the  robot’s  state  from  time  t  to  time 
t  +  1.  In  (4.5)  we  derived  an  expression  in  the  information  form  for  the  joint-distribution 
between  the  time  predicted  robot  pose,  xt+i,  and  its  previous  state,  xt  —  in  other  words 
p(x£+i,xt,  M|z£,  ut+1).  To  derive  the  time  propagated  distribution  p(xf+1,  M|z£,  ut+1), 
note  that  all  that  is  required  is  to  simply  marginalize  out  the  previous  state,  x£,  from  the 
joint-distribution  in  (4.5).  Referring  to  Table  3.1,  pg.  87  for  marginalization  of  a  Gaussian 
in  the  information  form  we  have2 


Th+ 1  = 


p(xt+i , M | zJ , ut+1 )  =M  ^t+i.At+i)  =  J p(xt+i,xt,M|z£,ut+1)dx£ 
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where 


ft  =  (A*tIt  +  FtQ“1F)  and  =  Q_1  -  Q^Fft-^Q-1 

=  Q-1  -  Q-lF(FTQ-‘F  +  A„«,r1FTQ-1 
=  (Q  +  FA-1ItFT)-1. 


ESDF  Prediction  is  Constant-Time 

An  important  consequence  of  the  delayed-state  framework  is  that  (4.11)  can  be  imple¬ 
mented  in  constant-time.  To  see  this  we  refer  to  Fig.  4-2,  which  illustrates  the  effect  of 
motion  prediction  for  a  collection  of  delayed-states.  We  begin  with  the  Markov  network 
of  Fig.  4-2(a)  showing  a  segregated  collection  of  delayed-states.  Our  view-based  “map” 

2The  simplification  of  'l1  employs  the  matrix  inversion  lemma: 

(A  +  BCBt)_1  =  A-1  -  A-1B(BtA_1B  +  C-1)-1BtA_1. 


Ill 


Figure  4-2  ESDF  motion  prediction  is  constant- time.  Shown  below  is  a  graphical  illustration 
of  the  effect  of  motion  prediction  within  a  delayed-state  framework,  (a)  The  Markov  network 
for  a  segregated  collection  of  delayed-states.  The  view- based  “map”,  M,  is  composed  of  the  set 
M  =  {xe_4,Xf_3,x(_2,Xt_i},  which  is  a  collection  of  delayed-states  that  are  interlinked  by  camera 
constraints.  The  previous  and  predicted  robot  states,  xt  and  xi+1  respectively,  are  serially  linked 
to  the  map.  Below  the  Markov  network  is  a  schematic  showing  the  nonzero  structure  (colored  in 
gray)  of  the  associated  information  matrix,  (b)  Recalling  from  Table  3.1  the  expression  for  marginal¬ 
ization  of  a  Gaussian  in  information  form,  we  see  that  the  bottommost  schematic  illustrates  this 
operation  graphically.  The  end  result  is  that  only  the  states  that  were  linked  to  xt  (i.e.,  x*_i  and 
Xf+i)  ^  effected  by  the  marginalization  operation  as  indicated  by  the  cross-hairs  and  black  dots 
superimposed  on  AQQ. 
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corresponds  to  the  set  of  states  M  =  {x*_4,  x*_3,  x*_2,  x*_i},  which  have  an  interconnected 
dependence  due  to  camera  measurements  while  the  states  xt  and  x*+i  are  only  serially  con¬ 
nected  and  correspond  to  the  previous  and  predicted  robot  states  respectively.  Referring 
back  to  Table  3.1,  we  see  that  Fig.  4-2(b)  illustrates  the  effect  of  marginalization  on  the 
information  matrix.  We  note  that  since  x*  is  only  serially  connected  to  x*+i  and  x*_i, 
marginalizing  it  out  only  requires  modifying  the  information  blocks  associated  with  these 
elements  (i.e.,  AXt+lXfrM  and  AXt_lXt  l  shown  with  cross-hairs  and  the  symmetric  blocks 

Axt+^t-i  =  ^xT-ixt+i  shown  with  black  dots).  Therefore*,  since  only  a  fixed  portion  of  the 
information  matrix  is  ever  involved  in  the  calculation  of  (4.11),  motion  prediction  can  be 
performed  in  constant-time.  This  is  an  important  result  since  in  practice  the  fusion  of  asyn¬ 
chronous  navigation  sensor  measurements  (e.g.,  odometry,  compass)  implies  that  prediction 
is  typically  a  high-bandwidth  operation  (e.g.,  (9(10  Hz)  or  more). 

4.2.4  State  Recovery 

The  information  form  of  the  Gaussian  is  parameterized  by  its  information  vector  and  in¬ 
formation  matrix,  r\t  and  A*  respectively.  However,  the  expressions  for  motion  prediction 
(4.11)  and  measurement  update  (4.10)  additionally  require  sub-elements  from  the  state 
mean  vector,  /xt,  so  that  the  nonlinear  models  (4.2)  and  (4.7)  can  be  linearized.  Therefore, 
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in  order  for  the  information  form  to  be  a  computationally  efficient  parameterization  for 
delayed-states,  we  also  need  to  be  able  to  easily  recover  portions  of  the  state  mean  vector. 
Fortunately,  this  is  the  case  due  to  the  sparse  structure  of  the  information  matrix,  At. 


Full  State  Recovery 

Naive  recovery  of  our  state  estimate  through  matrix  inversion  results  in  cubic  complexity 
and  destroys  any  efficiency  gained  over  the  EKF.  Fortunately,  closer  inspection  reveals  that 
the  recovery  of  the  state  mean,  fjLt ,  can  be  posed  more  efficiently  as  solving  the  sparse, 
symmetric,  positive-definite,  linear  system  of  equations  shown  in  (4.12). 

Atfit  =  r)t  (4.12) 

Such  systems  can  be  solved  via  the  classic  iterative  method  of  conjugate  gradients  (CG)  [143]. 
In  general,  CG  can  solve  this  system  in  n  iterations  with  0(n)  cost  per  iteration  where  n 
is  the  size  of  the  state  vector  (i.e.,  0(n2)  total  cost),  and  typically  in  many  fewer  iterations 
if  the  initialization  is  good  [78].  In  addition,  since  the  state  mean,  [it,  typically  does  not 
change  significantly  with  each  measurement  update  (excluding  key  events  like  loop-closure), 
this  relaxation  can  take  place  over  multiple  time  steps  using  a  fixed  number  of  iterations 
per  update  as  pioneered  by  Duckett,  Marsland,  and  Shapiro  [33]  and  Thrun  et  al.  [160]. 
The  caveat  is  that  a  fixed  number  of  iterations  does  not  necessarily  guarantee  optimal  state 
recovery  [128]. 

Recently,  a  couple  of  newly  proposed  multilevel  relaxation  SLAM  algorithms  that  ap¬ 
pear  capable  of  solving  (4.12)  in  linear  asymptotic  complexity  (i.e.,  0(n ))  have  appeared 
in  the  literature.  These  new  state  recovery  techniques  by  Konolige  [78]  and  Frese,  Lars- 
son,  and  Duckett  [52]  achieve  the  dramatic  computational  reduction  by  subsampling  poses 
and  performing  the  relaxation  over  multiple  spatial  resolutions.  Borrowing  multigrid  re¬ 
laxation  techniques  pioneered  in  the  early  1970’s  for  solving  discretized  partial  differential 
equations  (PDEs)  [14],  the  key  idea  is  that  spatial  subsampling  improves  relaxation  conver¬ 
gence  rates.  Hence,  since  measurement  updates  (§4.2.2)  and  motion  prediction  (§4.2.3)  are 
both  constant-time  operations  that  depend  upon  the  solution  to  (4.12),  this  suggests  that 
ESDFs  are  at  most  0(n)  complexity  if  we  employ  the  relaxation  techniques  of  [52,78]. 


Partial  State  Recovery 

An  important  observation  regarding  the  expressions  for  motion  prediction  (4.11)  and  mea¬ 
surement  updates  (4.10)  is  that  they  only  require  knowing  subsets  of  the  state  mean  \it.  In 
light  of  this  we  note  that  rather  than  solving  for  the  complete  state  mean  vector,  we 
can  partition  (4.12)  into  two  sets  of  coupled  equations  as 


A  a 

A& 

w 

Ve 

A  « 
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(4.13) 


This  partitioning  of  nt  into  what  we  call  the  “local  portion”  of  the  map,  n(,  and  the 
“benign  portion”,  nb,  allows  us  to  sub-optimally  solve  for  the  local  portion  of  the  map  we 
are  interested  in  constant-time.  By  holding  our  current  estimate  for  nb  fixed,  we  can  solve 
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(4.13)  for  an  estimate  of  fie  as 


=  Aeel  (iff  -  AftAb)-  (4-14) 

Equation  (4.14)  provides  us  with  a  method  for  recovering  an  estimate  of  the  local  map, 
/x^,  provided  that  our  estimate  for  the  benign  portion,  fib,  is  a  decent  approximation  to 
the  actual  mean,  fib.  Furthermore,  note  that  only  a  subset  of  fib  is  actually  required  in  the 
calculation  of  corresponding  to  the  nonzero  elements  in  the  sparse  matrix  A#,.  In  terms  of 
Thrun  et  al.’s  notation  [160],  this  active  subset,  denoted  /x^,  represents  the  Markov  blanket 
of  and  corresponds  to  elements  that  are  directly  connected  to  /x^  in  the  associated  Markov 
network.  Therefore,  calculation  of  the  local  map,  /x^,  only  requires  an  estimate  of  the  locally 
connected  delayed-state  network,  /xjj-,  and  does  not  depend  upon  passive  elements  in  the 
benign  portion  of  the  map. 

In  particular,  (4.14)  provides  an  accurate  and  constant-time  approximation  for  recover¬ 
ing  the  robot  mean  during  motion  prediction,  and  during  incorporation  of  high  bandwidth 
navigation  sensor  measurements.  Since  the  robot  state  is  only  serially  connected  to  the 
map,  A#,  has  only  one  nonzero  block-element  (§4.2.3).  Therefore,  solving  for  the  robot 
mean  is  constant-time.  Note,  though,  that  (4.14)  will  only  provide  a  good  approximation 
so  long  as  the  active  mean  estimate,  /xjj",  is  accurate.  In  the  case  that  it  is  not  (e.g.,  as  a 
result  of  loop  closure),  then  the  true  full  mean,  /xt,  should  be  solved  for  via  (4.12). 


4.3  Consistent  Covariance  Recovery 

While  recovering  the  mean  is  a  vital  component  for  making  real-world  decisions  when  in¬ 
teracting  with  the  environment,  it  alone  is  not  always  sufficient.  For  example,  robotic  tasks 
such  as  motion  planning,  data  association,  and  loop-closing  usually  require  some  notion  of 
the  joint- uncertainty  between  the  state  estimates  (i.e.,  the  covariance  matrix).  Furthermore, 
estimates  of  how  “certain”  we  are  of  map  relations  can  have  imperative  implications  on  the 
action  of  the  robot  —  quoting  Uhlmann  [23]: 

An  autonomous  vehicle  controller,  for  example,  might  not  take  evasive  action 
in  response  to  an  estimate  that  places  the  mean  position  of  the  vehicle  at  the 
edge  of  the  road  and  an  uncertainty  of  only  one  centimeter.  But  if  the  same 
estimate  had  an  uncertainty  of  a  meter,  the  controller  would  likely  direct  the 
vehicle  toward  the  center  of  the  lane  to  avoid  the  worst  case  possibility  that  it 
is  actually  off  the  road. 

While  it  is  hard  to  define  a  single  definition  of  consistency  employed  uniformly  in  the 
prior  literature  on  SLAM,  intuitively  consistency  reflects  the  goal  that  the  error  estimates 
computed  by  the  filter  should  “match”  the  actual  errors.  In  relation  to  SLAM,  consistency  of 
the  error  estimates  is  important  for  data  association  —  determining  the  correspondences  for 
measurements  [118].  This  is  important  both  in  the  context  of  “local”  SLAM  (detecting  and 
tracking  features),  and  in  a  “global”  sense  (for  closing  loops).  If  the  SLAM  error  estimates 
are  too  small  (over-confident),  both  of  these  tasks  can  become  difficult,  as  will  be  shown  in 
§4.5.2. 
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Our  strategy  for  approximate  covariance  recovery  from  the  information  form  is  formu¬ 
lated  upon  gaining  efficient  access  to  meaningful  values  of  covariance  that  are  consistent 
with  respect  to  the  actual  covariances  obtained  by  matrix  inversion.  The  motivation  for  a 
consistent  approximation  is  that  we  guard  against  under-representing  the  uncertainty  asso¬ 
ciated  with  our  state  estimates,  which  otherwise  could  lead  to  data  association  and  robot 
planning  errors.  It  is  the  access  to  meaningful  values  of  joint-covariance  for  robot  interac¬ 
tion,  data  association,  and  decision  making  in  the  information  form  which  motivates  our 
discussion. 

4.3.1  Setting  the  Stage 
Naive  Covariance  Recovery 

The  covariance  matrix  corresponds  to  the  inverse  of  the  information  matrix,  however,  ac¬ 
tually  recovering  the  covariance  via  matrix  inversion  is  not  practical  since  this  is  a  cubic 
operation.  Additionally,  while  the  information  matrix  can  be  a  sparse  representation  for 
storage,  in  general,  its  inverse  results  in  a  fully  dense  covariance  matrix  despite  any  sparsity 
in  the  information  form  [51].  This  implies  that  calculating  the  covariance  matrix  requires 
quadratic  memory  storage  —  a  requirement  that  could  become  prohibitive  for  very  large 
maps  (e.g.,  maps  >  O(105)  state  elements).  To  illustrate  this  point,  for  the  10, 404  x  10, 404 
information  matrix  shown  in  Fig.  4-1,  storing  it  in  memory  only  requires  4.5  MB  of  double 
precision  storage  for  the  nonzero  elements  while  its  inverse  requires  over  865  MB. 

What  Do  We  Need? 

Fortunately,  recovering  the  full  covariance  matrix  usually  isn’t  necessary  for  SLAM  as  many 
of  the  data  association  and  robotic  planning  decisions  typically  do  not  require  the  entire  co- 
variance  matrix,  but  only  the  covariance  over  subsets  of  state  variables  [30].  Unfortunately, 
accessing  only  subsets  of  state  variables  in  the  information  form  is  not  an  easy  task.  As  we 
saw  in  §3.2.2,  the  covariance  and  information  representations  of  the  Gaussian  distribution 
lead  to  very  different  computational  characteristics  with  respect  to  the  fundamental  prob¬ 
abilistic  operations  of  marginalization  and  conditioning.3  In  particular,  marginalization  is 
easy  in  the  covariance  form  since  it  corresponds  to  extracting  the  appropriate  sub-block 
from  the  covariance  matrix  while  in  the  information  form  it  is  hard  because  it  involves  cal¬ 
culating  the  Schur  complement  over  the  variables  we  wish  to  keep.  Therefore,  even  though 
we  may  only  need  access  to  covariances  over  subsets  of  the  state  elements  [30]  (and  thus 
only  have  to  invert  a  small  information  matrix  related  to  the  subset  of  variables  we  are  in¬ 
terested  in),  accessing  them  in  the  information  form  requires  marginalizing  out  most  of  our 
state  vector  resulting  in  cubic  complexity  due  to  matrix  inversion  in  the  Schur  complement. 

Prior  Art 

To  get  around  this  dilemma  for  SEIFs,  Thrun  et  al.  [92, 160]  proposed  a  data  association 
strategy  based  upon  using  conditional  covariances.  Since  conditional  information  matrices 
are  easy  to  obtain  in  the  information  form  (simply  extract  a  sub-block  over  the  desired 

3See  Table  3.1,  pg.  87. 
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variables),  their  strategy  is  to  choose  an  appropriate  sub-block  from  the  information  matrix 
such  that  it’s  inverse  approximates  the  actual  covariance  for  the  subset  of  variables  they  are 
interested  in.  In  particular,  given  two  state  variables  of  interest,  x*  and  Xj,  their  approx¬ 
imation  selects  the  joint  Markov  blanket,  MtuM|  (i.e.,  represents  state  variables 
directly  connected  to  x^  in  a  graph  theoretic  sense  within  the  information  matrix),  and  ad¬ 
ditionally  if  the  intersection  is  null  (i.e.,  n  =  0),  variables  along  a  path  connecting 
x*  and  Xj  topologically.  Their  method  then  extracts  and  inverts  this  sub-block  to  obtain  a 
joint-covariance  matrix  for  x*  and  Xj  conditioned  on  all  other  variables  that  have  an  indirect 
influence.  They  note  that  empirical  testing  shows  that  their  approximation  method  seems 
to  work  well  in  practice  for  their  application  [92],  despite  the  fact  that  using  conditional 
covariances  should  result  in  an  over-confident  approximation. 

4.3.2  Consistent  Covariances  for  Data  Association 

In  this  section  we  outline  our  strategy  for  recovering  approximate  joint-covariances  use¬ 
ful  for  data  association.  Before  we  begin,  we  want  it  to  be  clear  to  the  reader  that  our 
technique  for  obtaining  and  maintaining  these  covariances  should  not  be  confused  with  the 
actual  updating  and  mechanics  of  the  information  parameterization.  What  we  present  in 
the  following  is  a  way  of  maintaining  covariance  bounds  that  are  consistent  with  respect  to 
the  information  parameterization.  Furthermore,  these  covariances  are  used  for  data  asso¬ 
ciation  only  and  are  not  in  any  way  involved  in  the  actual  update  and  maintenance  of  the 
information  filter  representation.  With  that  being  said  we  now  present  our  algorithm. 

Efficiently  Accessing  the  Robot’s  Covariance 

We  begin  by  noting  that  recovery  of  our  state  estimate,  /xt,  from  the  information  form 
already  requires  that  we  solve  the  sparse,  symmetric,  positive-definite  system  of  equations 
(4.12)  and  moreover  that  this  system  can  be  solved  in  0(n)  time  using  [52,78].  Our  co- 
variance  recovery  strategy  for  the  information  form  is  based  upon  augmenting  this  linear 
system  of  equations  so  that  the  current  robot  pose  covariance  is  accessible  as  well.  Note  that 
by  definition  =  I  and,  therefore,  by  picking  the  ith  basis  vector  e*  from  the  identity 
matrix  we  can  use  it  to  selectively  solve  for  a  column  of  the  covariance  matrix,  denoted  as 

AtEt  =  I  =>  A t£*i  =  et  where  I  =  [ei , . . . ,  en] 

To  obtain  the  robot’s  covariance  at  any  time  step  we  simply  augment  our  original  linear 
system  (4.12)  to  include  an  appropriate  set  of  basis  vectors  Er  =  {er  }  such  that  the  solution 
to  (4.15)  provides  access  to  our  current  state  and  the  robot’s  covariance-column. 

At  [/it  2.r]  =  [r/t  Er]  (4.15) 

Inserting  a  New  Map  Element 

Given  that  (4.15)  provides  a  mechanism  for  efficient  access  to  the  robot’s  covariance-column, 
E,r,  we  exploit  it  to  obtain  useful  covariance  bounds  for  other  map  elements.  For  example, 
whenever  we  insert  a  new  image  I{  into  our  view-based  map,  we  must  correspondingly  add 
a  new  element  x,  into  our  view-based  SLAM  state  vector  [36,38].  This  new  state  element 
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corresponds  to  a  sampling  of  our  robot  state  at  time  t{  (i.e.,  x*  =  xr(U))  and  represents  our 
estimate  of  where  the  robot  was  when  it  took  that  image.  Since  the  two  states  are  coincident 
at  time  ti,  the  covariance  for  x*  is  E^  =  Err  and  can  be  obtained  by  solving  (4.15).  A 
well-known  property  of  SLAM  is  that  over  time  the  covariance  for  Xi  will  decrease  as  new 
sensor  measurements  are  incorporated  and  all  map  elements  become  fully  correlated  [30]. 
Therefore,  storing  E^  =  E a  as  our  initial  approximate  covariance  estimate  for  Xi  serves  as 
a  conservative  bound  to  the  actual  marginal  covariance  for  all  time,  (i.e.,  E^  >  Eii(£)). 


Data  Association 


In  our  application,  the  joint-covariance  between  the  time-projected  robot  pose,  xr,  and  any 
other  map  entry,  x*,  i.e., 


^  joint  — 


is  needed  for  two  operations:  link  proposal  (§2.3.3)  and  pose-constrained  correspondence 
searches  (§2.4.3).  Link  proposal  corresponds  to  hypothesizing  which  images  in  our  view- 
based  map  could  potentially  share  common  overlap  with  the  current  image  being  viewed  by 
the  robot,  denoted  /r,  and,  therefore,  could  potentially  be  registered  to  generate  a  relative- 
pose  measurement.  The  second  operation,  pose-constrained  correspondence  searches,  uses 
the  relative-pose  estimate  between  candidate  images  /*  and  Ir  to  restrict  the  image-based 
correspondence  search  to  probable  regions  based  upon  a  two-view  point  transfer  relation.4 

To  obtain  the  actual  joint-covariance,  E joint,  from  the  information  form  requires  marginal¬ 
izing  out  all  other  elements  in  our  map  except  for  xr  and  Xi  and  leads  to  cubic  complexity 
in  the  number  of  eliminated  variables.  However,  we  can  obtain  a  bounded  approximation 
to  E joint  at  any  time-step  by  using  the  solution  from  (4.15)  to  provide  us  with  the  cur¬ 
rent  covariance-column,  E*r,  representing  the  joint-covariances  between  the  time- projected 
robot  and  all  other  map  entries  (note  that  this  solution  is  equivalent  to  what  could  be 
obtained  by  full  matrix  inversion  of  A*).  Using  this  result  we  can  construct  a  conservative 
joint-covariance  approximation  to  E joint  as 


(4.16) 


where  Err  and  E *r  are  extracted  from  E*r,  and  E^  is  our  conservative  covariance  bound 
for  Xi.  Note  that  (4.16)  represents  a  valid  positive-semidefinite  and,  therefore,  consistent 
approximation  satisfying 


^  joint  S  joint  — 

since  E^  —  E^  >  0.  Given  that  (4.16)  provides  a  consistent  approximation  to  the  true 
covariance,  we  can  use  it  to  compute  conservative  first-order  probabilities  of  relative-pose 
(i.e.,  xri  =  ©xr  ©  Xi)  for  link  hypothesis  and  correspondence  searches. 

4 Note  that  the  standard  maximum  likelihood  data  association  technique  for  feature-based  SLAM  also 
only  depends  on  extracting  ^ joint  [30]. 


0  0 
0  Eii  —  Eii 
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Updating  the  Covariance  Bounds 

Since  E^  serves  as  a  conservative  approximation  to  the  actual  covariance,  E a,  for  map 
element  x*,  we  would  like  to  be  able  to  place  tighter  bounds  on  it  as  we  gather  more  mea¬ 
surement  information.  In  fact,  the  careful  reader  will  recognize  that  our  SLAM  information 
filter  is  implicitly  already  doing  this  for  us,  however  the  issue  is  that  extracting  the  actual 
filter  bound,  E from  the  information  matrix  representation  is  not  particularly  convenient. 
Note  that  while  we  could  access  E^  by  solving  for  the  covariance-column  E*i  using  an  appro¬ 
priately  chosen  set  of  basis  vectors,  the  reason  for  not  doing  this  is  that  iteratively  solving 
systems  like  (4.15)  is  efficient  only  when  we  have  a  good  starting  point  [33,78].  In  other 
words,  when  we  solve  (4.15)  for  the  latest  state  and  robot  covariance-column,  our  estimates 
/xt  and  E*r  from  that  last  time-step  serve  as  good  seed  points  and,  therefore,  typically  only 
require  a  small  number  of  iterations  per  time-step  to  update  (excluding  loop-closing  events). 
In  the  case  of  solving  for  an  arbitrary  column,  E*»,  we  do  not  have  a  good  a  priori  starting 
point  and  thus  convergence  will  be  slower. 

Our  approach  for  tightening  the  bound  E a  is  to  use  our  joint-covariance  approximation 
(4.16)  and  perform  a  simple  constant-time  Kalman  update  on  a  per  re-observation  basis. 
In  other  words,  we  only  update  our  covariance  bound,  E^,  when  the  robot  re-observes  Xj 
and  successfully  generates  a  relative-pose  measurement,  z ri,  by  registering  images  I{  and 
Ir.  We  then  use  that  relative-pose  measurement  to  perform  an  EKF  update  (4.9)  on  the 
fixed  size  state  vector  y  =  [x7,x^]T  obtaining  the  new  conservative  bound  E+. 

Mathematically,  the  distribution  over  y  corresponds  to  marginalizing  out  all  elements 
in  our  state  vector  except  for  xr  and  xt  as 

p(y)  =  [  A/'_1(^t)At)dxJ  =  f  (4.17) 

^Xj^{xr,Xi}  ^Xj^{xr,Xi} 


which  results  in  the  distribution: 


p(y)  =  W( 
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Err  £J' 

fix. 

> 

Eir  Eii_ 

(4.18) 


Recalling  that  (4.16)  already  provides  us  with  a  consistent  approximation  to  this  distribu¬ 
tion,  we  have 


p(y)  =  W( 


Mr 

Srr 
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^ir 

fit. 
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.Eir 

ta 

(4.19) 


where  the  only  difference  between  the  actual  distribution  (4.18)  and  its  approximation 
(4.19)  is  the  conservative  marginal  Eii.  Using  the  measurement  zri  we  now  perform  a 
Kalman  update  (4.9)  on  (4.19)  yielding  the  conditional  distribution  p(y|zri)  from  which  we 
retain  only  the  updated  marginal  bound  E^  for  element  xt .  This  update  is  computed  in 
constant-time  for  each  re-observed  feature.  Algorithm  4.1  summarizes  the  procedure. 

Note  that  by  conceptually  performing  the  marginalization  step  of  (4.17)  before  com¬ 
puting  the  Kalman  update,  we  have  avoided  any  inconsistency  issues  associated  with  only 
storing  the  marginal  bounds,  Eii,  and  not  representing  the  intra-map  correlations.  This 
ensures  that  our  update  step  will  result  in  a  consistent  marginal  bound  for  data  association 
that  will  improve  over  time  as  we  re-observe  map  elements. 
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Algorithm  4.1  Calculation  of  marginal  covariance  bounds  used  for  data  association. 
Require:  E*r {initialize  bound} 

1:  if  Xj  =  new  map  element  then 
2:  store  E a  «—  Err 

3:  end  if 


Require:  /it,E*r{data  association  and  bound  update} 
4:  for  all  Xi  do 


5: 

6: 

7: 

8: 

9: 

10: 

11: 

12: 

13: 

14: 


E* 


joint 


Err  Eri 

_£ri  £ii_ 

compute  link  hypothesis  as  outlined  in  §2.3.3 
if  candidate  link  then 

do  constrained  correspondence  search  on  and  Ir  as  outlined  in  §2.4.3 
if  image  registration  success  then 

do  Kalman  update  on  E joint  using  measurement  zr{ 
store  Eit  <—  E+ 

end  if 
end  if 
end  for 


4.4  Discussion 

4.4.1  Connection  to  Lu-Milios 

The  concept  of  a  view-based  map  representation  has  strong  roots  going  back  to  a  seminal 
paper  by  Lu  and  Milios  [94].  Their  approach  sidestepped  difficulties  associated  with  feature 
segmentation  and  representation  by  doing  away  with  an  explicit  feature-based  parameter¬ 
ization  of  the  environment.  Rather,  their  technique  indirectly  represented  a  physical  map 
via  a  collection  of  global  robot  poses  and  raw  scan  data.  To  determine  the  global  poses,  they 
formulated  the  nonlinear  optimization  problem  as  one  of  estimating  a  set  of  global  robot 
poses  consistent  with  the  relative-pose  constraints  obtained  by  scan  matching  and  odom- 
etry.  They  then  solved  this  sparse  nonlinear  optimization  problem  in  an  batch- iterative 
fashion.  Our  ESDF  framework  essentially  attempts  to  recursively  solve  the  same  problem. 
Note,  though,  that  in  the  ESDF  framework  the  nonlinear  relative-pose  constraints  are  only 
linearized  once  about  the  current  state  when  the  measurement  is  incorporated  via  (4.10) 
while  in  the  noncausal  Lu-Milios  batch  formulation  they  are  re-linearized  around  the  current 
best  estimate  of  the  state  at  each  iteration  of  the  nonlinear  optimization.  This  implies  that 
while  the  ESDF  solution  can  be  performed  recursively,  it  will  be  more  prone  to  linearization 
error. 

4.4.2  Connection  to  Feature-Based  SLAM 

Another  interesting  theoretical  connection  involves  relating  the  delayed-state  SLAM  frame¬ 
work  to  feature-based  SLAM.  In  Chapter  3  we  saw  that  the  feature-based  SLAM  information 
matrix  is  naturally  dense  as  a  result  of  marginalizing  out  the  robot’s  trajectory.  On  a  sim- 
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Figure  4-3  ESDF’s  connection  to  feature-based  SLAM.  Conceptually,  view- based  SLAM  can  be 
viewed  as  marginalizing  out  the  landmarks  (i.e.,  Li,L2,La),  which  in  turn  causes  edges  to  appear 
between  spatially-local  samples  from  the  robot’s  trajectory. 


(a)  Landmark  and  trajectory  SLAM  posterior  as  a  Markov  network. 


(b)  Delayed-state  Markov  network  after  marginalizing  out  the  landmarks. 


ilar  train-of- thought,  conceptually  we  can  view  the  off-diagonal  elements  appearing  in  the 
delayed-state  SLAM  information  matrix  as  being  a  result  of  marginalizing  out  the  landmarks 
as  illustrated  by  Fig.  4-3.  Since  landmarks  are  only  ever  locally  observed,  they  only  create 
links  to  spatially  close  robot  states.  Therefore,  each  time  we  eliminate  a  landmark,  it  intro¬ 
duces  a  new  off-diagonal  entry  into  the  information  matrix  that  links  all  robot  states  that 
observed  that  landmark.  Interestingly,  this  same  type  of  constraint  phenomenon  also  ap¬ 
pears  in  photogrammetry  and  in  particular  in  large  scale  bundle  adjustment  techniques  [163]. 
These  techniques  are  based  upon  a  partitioned  Levenberg-Marquardt  algorithm  that  takes 
advantage  of  the  inherent  sparsity  between  camera  and  3D  feature  constraints  in  the  re¬ 
construction  problem.  Their  central  component  is  based  upon  eliminating  3D-structure 
equations  to  yield  a  coupled  set  of  equations  over  camera  poses  that  they  solve  and  then 
back-substitute  to  recover  the  associated  3D-structure.  Therefore,  loosely  speaking,  the 
delayed-state  information  framework  represents  a  recursive  linearized  formulation  of  this 
same  problem. 

4.4.3  Video  Frame  Rates 

Finally,  note  that  a  view-based  representation  is  still  applicable  even  with  much  higher 
frame  rates.  In  our  underwater  VAN  application,  a  digital-still  image  is  collected  every  few 
seconds  from  a  down-looking  monocular  camera.  Since  this  typically  results  in  temporal 
overlap  of  the  order  of  15-35%,  we  include  all  image  frames  into  our  view-based  map 
representation.  However,  in  the  general  case  where  much  higher  frame  rates  are  available, 
note  that  we  can  still  use  a  view-based  methodology  by  selectively  subsampling  key  frames 
from  our  video  sequence  to  serve  as  spatial  “anchor  points”  in  our  view-based  map.  Re¬ 
observation  of  these  key  frames  (coupled  with  successful  image  registration)  provides  a  zero- 
drift  spatial  measurement  of  robot  motion  allowing  us  to  “close-the-loop” .  Furthermore, 
we  can  exploit  the  higher  frame  rates  to  get  an  improved  estimate  of  visual  odometry 
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Figure  4-4  View-based  SLAM  with  video  frame  rates,  (a)  Our  collection  of  anchor  images 
{/40,  •  •  •  >  lAj }  represents  a  subsampling  of  the  available  video  image  sequence  and  serves  as  our 
view-based  spatial  map  M.  Given  higher- frame  rates,  though,  we  can  exploit  the  additional  views 
between  temporally  consecutive  anchor  images  I  a,  and  I  a k  to  get  an  improved  estimate  of  incre¬ 
mental  motion.  The  improved  motion  estimate  comes  from  a  local  bundle  adjustment  that  includes 
the  temporary  frame  set  T  =  {/t0»  •  •  • ,  /rm}-  (b)  The  result  is  a  serial  constraint  between  Ia3  and 
lAk  that  is  much  more  robust  than  a  single  pairwise  measurement  between  the  pair. 


(a)  The  temporary  frame  set  T  provides  additional  constraints. 


by  performing  a  local-bundle  adjustment  over  all  frames  occurring  between  temporally 
consecutive  anchor  images  (Fig.  4-4).  Similar  in  spirit  to  Zhang  and  Shan’s  visual  odometry 
[181],  the  local-bundle  adjustment  provides  an  improved  estimate  of  incremental  camera 
motion  because  it  incorporates  additional  constraints.  However,  by  also  maintaining  a 
collection  of  anchor  images  in  our  framework,  we  retain  the  ability  to  close-the-loop  (with 
order  0(n)  complexity). 

4.5  Results 

4.5.1  Experimental  Validation:  VAN  EKF  vs.  VAN  ESDF 
Experimented  Setup 

In  this  section  we  demonstrate  the  efficiency  of  the  ESDF  information  framework  as  com¬ 
pared  to  the  EKF-based  formulation  of  Chapter  2.  For  this  purpose,  we  use  the  JHU  dataset 
from  §2.5.2.  Note  that  all  processing  was  done  using  Matlab  R13  running  on  an  Intel 
1.8  GHz  Pentium-4M  laptop  with  1024  MB  of  RAM.  For  the  purposes  of  benchmark  com¬ 
parison,  we  employed  the  full  state  recovery  technique  of  (4.12)  after  every  camera  mea¬ 
surement  and  otherwise  used  the  constant-time  partial  state  recovery  method  of  (4.13)  to 
advance  the  robot  state.  To  briefly  recap,  the  experimental  setup  consisted  of  a  downward¬ 
looking  digital-still  camera  mounted  on  a  moving  underwater  pose  instrumented  ROV  at 
the  JHU  Hydrodynamic  Test  Facility.  Their  vehicle  is  instrumented  with  a  typical  suite  of 
oceanographic  dead-reckoning  navigation  sensors  capable  of  measuring  heading,  attitude, 
xyz  bottom-referenced  Doppler  velocities,  and  a  pressure  sensor  for  depth. 
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Figure  4-5  This  figure  contrasts  the  exact  sparsity  of  the  ESDF  information  matrix  versus  the  den¬ 
sity  of  the  full  covariance  matrix,  (a)  Spatial  topology  of  a  101  image  sequence  of  underwater  images 
collected  from  the  JHU  ROV  —  in  all  there  are  307  camera  constraints,  (b)  Nonzero  elements  of  the 
covariance  matrix;  all  elements  above  a  normalized  correlation  score  of  10%  are  shown,  (c)  Nonzero 
elements  of  the  information  matrix.  Note  that  the  covariance  matrix  has  12242  =  1,498, 176  nonzero 
elements  while  the  information  matrix  has  only  60,048.  The  covariance  matrix  and  information  ma¬ 
trix  are  numerically  equivalent,  however  the  information  matrix  is  exactly  sparse. 


(a)  VAN  recovered  trajectory  for  101  images  of  the  JHU  dataset.  (c)  Information  matrix. 


Experimental  Results 

Fig.  4-5  shows  the  result  of  estimating  the  ROV  delayed-states  associated  with  a  101  im¬ 
age  sequence  using  a  full  covariance  EKF  and  sparse  EIF.  For  this  experiment,  the  vehicle 
started  near  the  top-left  corner  of  the  plot  at  (-2.5,2.75)  and  then  drove  a  course  con¬ 
sisting  of  two  grid-based  surveys,  one  oriented  SW  to  NE  and  the  other  W  to  E.  The 
top  plot  shows  the  spatial  XY  pose  topology,  3-sigma  confidence  bounds,  and  link  net¬ 
work  of  camera  constraints.  Green  links  correspond  to  temporally  consecutive  images  that 
were  successfully  registered  while  red  links  correspond  to  spatially  registered  image  pairs. 
The  rightmost  plot  in  this  figure  compares  the  densities  associated  with  the  EKF  covari¬ 
ance  matrix  versus  the  ESDF  information  matrix.  Note  that  while  the  EKF  correlation 
matrix  is  dense,  the  information  matrix  exhibits  a  sparse  tridiagonal  structure  with  the 
number  of  off-diagonal  elements  being  linear  in  the  number  of  camera  measurements.  In 
all  there  are  307  camera  constraints  (81  temporal  /  226-spatial)  and  each  delayed-state 
is  a  12-vector  consisting  of  6-pose  and  6-kinematic  components.  Therefore,  102  delayed- 
states  (101  images  plus  the  robot)  results  in  a  1224  x  1224  information  matrix  containing 
122(102  +  2  •  101)  +  62(2  •  226)  =  60, 048  nonzero  elements  as  shown.  We  found  the  EKF  and 
ESDF  solutions  to  be  numerically  equivalent,  and  furthermore,  that  the  ESDF  only  required 
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Figure  4-6  Time  comparison  of  EKF  vs.  ESDF  filtering  operations  using  the  JHU  dataset,  (a)  The 
top  figure  shows  both  the  EKF  and  ESDF  prediction  times  in  seconds  versus  the  number  of  delayed- 
state  entries  while  the  bottom  figure  shows  their  pointwise  ratio.  From  the  plots  we  can  gather  that 
prediction  is  a  constant-time  operation  for  both  filters,  (b)  Same  plot  layout  as  before,  but  now  we 
show  the  update  times  for  each  filter.  For  benchmark  comparison  we  employed  the  full-state  recovery 
technique  of  (4.12)  after  every  camera  measurement  (using  Matlab’s  “left  divide”  capability).  Note 
that  despite  this,  the  bottom  figure  shows  that  as  the  number  of  delayed-state  elements  increases, 
the  ESDF  becomes  more  efficient  relative  to  the  EKF  due  to  the  decreasing  density  of  the  information 
matrix. 


Prediction  Time:  EKF  vs.  EIF  Update  Time:  EKF  vs.  EIF 


Pointwise  Ratio  of  EKF  Predict  to  EIF  Predict  Pointwise  Ratio  of  EKF  Update  to  EIF  Update 


(a)  EKF  vs.  ESDF  prediction  times. 


(b)  EKF  vs.  ESDF  update  times. 


4%  of  the  storage  of  the  EKF  for  this  experiment. 

Turning  our  attention  now  to  filter  efficiency,  in  Fig.  4-6  we  compare  the  prediction 
and  update  times  of  the  EKF  to  those  of  the  ESDF.  In  particular,  we  see  that  prediction 
is  essentially  a  constant-time  operation  for  both  filters.  However,  Fig.  4-6(b)  shows  that 
ESDF  updates  are  orders  of  magnitude  more  efficient  than  corresponding  EKF  updates,  and 
moreover  that  they  become  more  efficient  relative  to  the  EKF  as  the  number  of  delayed- 
states  increases.  This  increase  in  relative  efficiency  with  state  size  results  from  a  decreasing 
density  in  the  information  matrix.  Also,  note  that  this  impressive  computational  reduction 
is  despite  the  fact  that  we  are  using  Matlab’s  “left-divide”  capability  to  solve  (4.12) 
(essentially  a  form  of  Gaussian  elimination).  Hence,  the  ESDF’s  results  could  be  even  better 
if  we  implemented  the  iterative  multilevel  state  recovery  techniques  of  [52,78].  In  summary, 
for  this  101  image  sequence:  data  collection  took  a  total  of  17  minutes,  EKF  processing 
required  29  minutes,  and  ESDF  estimation  was  just  over  a  minute  (i.e.,  17x  faster  than 
real-time).5 


5 Excludes  image  registration  time. 
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Figure  4-7  Institute  for  Exploration  ROV  Hercules  and  its  sensor  characteristics  [24]. 


(a)  ROV  Hercules. 
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(b)  Hercules’  pose  sensor  characteristics. 


4.5.2  Real-World  Results:  RMS  Titanic 
Experimental  Setup 

This  section  presents  experimental  results  validating  both  the  large-area  scalability  of  our 
ESDF  framework  and  the  consistency  of  our  covariance  recovery  technique.  The  wreck  of 
the  RMS  Titanic  was  surveyed  during  the  summer  of  2004  by  the  deep-sea  ROV  Hercules  [24] 
(Fig.  4-7)  operated  by  the  Institute  for  Exploration  of  the  Mystic  Aquarium.  The  ROV  was 
equipped  with  a  standard  suite  of  oceanographic  dead-reckon  navigation  sensors  capable 
of  measuring  heading,  attitude,  altitude,  xyz  bottom-referenced  Doppler  velocities,  and  a 
pressure  sensor  for  depth.  In  addition,  the  vehicle  was  also  equipped  with  a  calibrated  stereo 
rig  consisting  of  two  downward-looking  12-bit  digital-still  cameras  that  collected  imagery  at 
a  rate  of  1  frame  every  8  seconds.  However,  the  results  being  presented  here  were  produced 
using  imagery  from  one  camera  only  —  the  purpose  of  this  self-imposed  restriction  to  a 
monocular  sequence  is  to  demonstrate  the  general  applicability  of  our  VAN  methodology. 

Fig.  4-8  provides  a  summary  of  the  survey  pattern  and  comparison  of  the  different  nav¬ 
igation  methods  used  for  localizing  the  vehicle.  For  real-time  control,  the  vehicle  integrated 
bottom-lock  Doppler  velocity  measurements  to  get  a  dead-reckoned  estimate  of  XY  position. 
Additionally,  ship-based  ultra-short-baseline  (USBL)  tracking  provided  range  and  bearing 
fixes  to  the  vehicle.  Since  the  wreck  lies  at  a  depth  of  approximately  3750  m,  the  large  ship- 
to-vehicle  moment  arm,  coupled  with  angular  error  in  the  USBL  bearing  measurements, 
resulted  in  an  almost  useless  measurement  of  vehicle  tracking  as  indicated  by  the  widely 
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Figure  4-8  Mapping  results  from  a  summer  of  2004  ROV  survey  of  the  RMS  Titanic,  (a)  XY  plot 
comparing  the  raw  dead- reckon  navigation  data  (brown),  ship-board  ultra-short  baseline  tracking 
(gray),  and  reconstructed  VAN  trajectory  (red),  (b)  A  photomosaic  of  the  RMS  Titanic  constructed 
from  over  700  digital  still  images.  Note  that  this  photomosaic  is  presented  for  visualization  purposes 
only  as  a  representation  of  the  data  that  serves  as  input  to  our  algorithm.  It  is  the  result  of 
semi-automatic  processing  with  manual  selection  of  a  number  of  common  scene  points  to  guide  the 
photomosaicking  process.  This  could  be  considered  as  a  form  of  benchmark  against  which  fully 
autonomous  processing  can  be  compared. 
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(a)  Comparison  of  the  different  navigation  results. 
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(b)  Photomosaic  of  the  RMS  Titanic  (courtesy  Hanumant  Singh,  WHOI). 
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distributed  scatter  of  fixes  in  Fig.  4-8(a). 

The  vehicle  survey  consisted  of  a  grid-based  trajectory  containing  both  temporal  and 
side-to-side  cross-track  overlap.  The  survey  started  at  midships  and  proceeded  towards 
the  stern.  Upon  reaching  the  aft  portion  of  the  wreck,  the  camera  was  turned  off  and  the 
vehicle  was  piloted  back  towards  the  starting  point.  During  its  return  trip,  the  vehicle  lost 
bottom-lock  Doppler  velocity  measurements  for  a  period  of  time,  and  therefore,  was  unable 
to  dead-reckon  integrate  its  vehicle  position  during  this  time  period  —  this  is  the  cause  of 
the  “discontinuity”  in  the  brown  trajectory  of  Fig.  4-8(a).  After  the  vehicle  returned  near 
its  starting  point,  the  camera  was  turned  back  on  and  the  vehicle  finished  off  the  survey  by 
mapping  the  bow  portion  of  the  wreck. 


Experimental  Results:  Scalability 

Fig.  4-8(a)  shows  our  post- processed  VAN  trajectory  overlaid  in  red  on  top  of  the  traditional 
oceanographic  navigation  results.  Note  that  this  result  was  computed  from  camera  and 
dead-reckon  sensors  only  and  does  not  in  anyway  use  the  USBL  ship-based  tracking.  This 
result  constitutes  a  significant  advancement  in  the  current  state-of-the-art  as  it  represents 
the  largest  visually  navigated  underwater  dataset  to  date.  In  support  of  this  claim,  note 
that  the  vehicle  traversed  a  2D  path  length  of  3.1  km,  and  a  3D  XYZ  path  length  of  3.4  km 
during  its  maneuvers  to  maintain  a  safe  altitude  between  it  and  the  wreck.  The  convex 
hull  of  the  final  mapped  region  encompasses  an  area  over  3100  m2  and  in  all  a  total  of  866 
images  were  used  to  provide  3494  camera-generated  relative-pose  constraints. 

In  Fig.  4-9(a)-(d)  we  see  a  time  progression  of  the  camera  constraints  and  vehicle  trajec¬ 
tory  estimate.  In  particular,  Fig.  4-9(c)  shows  a  large  loop-closing  event  where  the  vehicle 
successfully  re-localized  by  correctly  registering  4  image  pairs  out  of  64  hypothesized  candi¬ 
dates  after  having  lost  bottom-lock  Doppler  velocity  measurements  for  an  extended  period 
of  time.  Fig.  4-9(e)  shows  the  final  3D  pose-constraint  network  and  Fig.  4-10  shows  its  2D 
view.  While  there  is  no  ground-truth  for  this  dataset,  the  resulting  pose-network  qualita¬ 
tively  appears  to  be  consistent  in  that  the  recovered  vehicle  trajectory  forms  the  outline  of 
a  ship’s  hull.  To  quantitatively  corroborate  this  observation  we  pairwise  triangulated  scene 
structure  using  only  our  image-to-image  correspondences  and  VAN  estimated  vehicle  poses, 
the  results  axe  shown  in  Fig.  4-11  and  Fig.  4-12.  In  particular,  a  striking  feature  of  both 
figures  is  that  the  triangulated  scene  structure  exhibits  good  coherency  both  globally  and 
locally.  This  result  is  even  more  impressive  when  taking  into  consideration  the  fact  that 
VAN  does  not  explicitly  enforce  consistency  of  structure,  instead  only  consistency  of  poses. 
This  furthermore  adds  evidence  that  VAN’s  global  pose  estimates  axe  near  ideal.  As  an 
aside,  note  that  the  quality  of  VAN’s  results  suggests  that  it  can  be  used  as  a  recursive  scal¬ 
able  solution  to  large-area  structure-from-motion  since  the  estimated  pose  and  triangulated 
structure  should  provide  a  good  initialization  point  in  an  optimal  bundle-adjustment  step. 

Finally,  Fig.  4-13  provides  a  histogram  of  VAN’s  precision,  expressed  as  a  percentage  of 
distance  traveled,  for  the  resulting  pose-network.  For  comparison  purposes,  figures-of-merit 
in  the  literature  for  state-of-the-art  dead-reckoned  DVL  position  error  axe  1%  for  a  typical 
heading  reference  [16]  and  0.1%  or  better  when  combined  with  an  INS  and  FOG  [101]. 
Note  that  VAN’s  (DVL,  plus  FOG,  plus  camera)  filter  estimated  precision  is  (on  a  majority) 
better  than  0.005%  distance  traveled  for  this  dataset  (note  that  0.005%  over  3.1  km  equals 
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Figure  4-9  Time-evolution  of  the  RMS  Titanic  pose  constraint  network,  (a)-(d)  Time  progression 
with  3-sigma  bounds,  from  left  to  right:  images  1-200,  1-400,  1-600,  1-800.  Green  links  represent 
temporally  consecutive  registered  image  pairs  while  red  links  represent  spatially  registered  image 
pairs.  Note  the  large  loop-closing  event  that  occurs  in  (c)  when  the  vehicle  returns  to  the  bow  of 
the  ship  after  having  traveled  from  the  stern  with  the  camera  turned  off.  (e)  XYZ  view  of  the  final 
3D  pose-constraint  network  associated  with  using  866  images  to  provide  3494  camera  constraints; 
3-sigma  bounds  are  unviewable  at  this  scale.  Note  that  the  recovered  vehicle  poses  and  image 
correspondences  can  be  used  as  direct  inputs  into  a  standard  bundle  adjustment  algorithm  for 
structure  recovery. 
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(a)  1-200.  (b)  1-400.  (c)  1-600.  (d)  1-800. 


Survey  Summary 

•  866  digital-still  images 

•  3494  camera  constraints 

•  Path  length:  2D  3.1  km  /  3D  3.4  km 

•  Convex  hull  of  the  mapped  area  >  3100  m2 
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Figure  4-10  The  recovered  XY  trajectory  and  covariance  bounds  for  the  RMS  Titanic,  (a)  Final 
XY  pose  constraint-network  associated  with  using  866  images  to  provide  3494  camera  constraints; 
3-sigma  bounds  are  shown,  (b)  A  zoomed  view  illustrating  the  consistency  of  the  data  association 
bounds  generated  by  our  algorithm.  Note  that  for  this  plot  the  3-sigma  bounds  have  been  inflated 
by  a  factor  of  30  for  visualization.  In  this  plot  we  have:  1)  the  initial  covariance  bounds  associated 
with  pose  insertion  (red),  2)  the  marginal  covariance  bounds  based  upon  constant-time  Kalman 
updates  (gray),  and  3)  the  actual  marginal  covariance  bounds  obtained  by  matrix  inversion  (green). 
Note  that  all  of  the  estimated  bounds  (gray)  were  verified  to  be  consistent  with  respect  to  the  actual 
marginal  covariance  (green)  by  performing  Cholesky  decomposition  on  their  difference  to  establish 
positive  definiteness.  The  reason  why  some  of  the  estimated  covariance  bounds  (gray)  are  tighter 
approximations  than  others  to  the  actual  filter  bounds  (green)  is  because  our  algorithm  only  updates 
the  bounds  on  a  per  re-observation  basis.  Hence,  if  the  robot  is  sufficiently  well- localized  when  it 
re-observes  an  image,  then  the  covariance  bound  of  the  corresponding  delayed-state  improves. 
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(a)  2D  view  of  the  final  pose  constraint  network. 
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Figure  4-11  The  triangulated  structure  for  the  RMS  Titanic  as  computed  from  the  final  VAN  pose 
estimate  (Fig.  4-9(e))  and  saved  pairwise  correspondences.  Triangulated  3D  points  are  defined  as 
the  midpoint  of  the  minimum  perpendicular  distance  between  two  corresponding  camera  rays.  Since 
structure  is  triangulated  on  a  pairwise  basis,  redundant  3D  points  may  occur,  (a)  A  histogram  of  the 
triangulation  error  (i.e.,  the  minimum  perpendicular  distance)  for  ail  points  across  all  established 
camera  pairs.  Note  that  the  histogram  contains  two  measures  and  that  its  y-axis  has  been  clipped 
to  show  fine  detail.  The  first  measure  (white)  is  the  triangulation  error  based  upon  the  relative-pose 
camera  measurements  used  by  the  ESDF  filter.  This  should  serve  as  a  baseline  for  the  best  possible 
pairwise  triangulation  error  since  each  pose  measure  is  the  product  of  a  two- view  bundle  adjust¬ 
ment.  The  second  measure  (black)  is  the  triangulation  error  based  upon  the  final  VAN  estimated 
poses.  Scale  for  both  measures  has  been  set  by  the  VAN  estimate.  Note  that  the  VAN  triangulated 
errors  are  more  widely  distributed  than  the  pairwise  bundle- adjusted  poses.  However,  this  is  to  be 
expected  since  VAN’s  global  estimate  takes  into  account  all  measured  camera  constraints,  (b)  The 
VAN  triangulated  points  rendered  in  3D  (467,512  points  in  total).  The  “outliers”  are  due  to  poor 
triangulation  resulting  from  residual  error  in  the  global  VAN  estimate.  Again,  this  error  is  to  be 
expected  since  VAN  is  not  directly  enforcing  structure  consistency,  only  pose  consistency.  In  fact, 
because  VAN  is  enforcing  only  pose  consistency,  the  overall  coherence  of  the  point  cloud  corrobo¬ 
rates  the  global  consistency  of  VAN’s  pose  estimates,  (c)-(d)  These  plots  display  the  same  data  as 
(a)-(b),  but  for  a  reduced  set  of  triangulated  data  (363,799  points).  This  reduced  set  corresponds 
to  throwing  away  all  points  having  a  triangulation  error  greater  than  7.5  cm. 
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(d)  Reduced  set  point  cloud. 
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Figure  4-12  The  triangulated  point  cloud,  resulting  Delaunay  surface,  and  texture  mapped  render¬ 
ing  for  the  RMS  Titanic,  (a)  The  (reduced  set)  triangulated  point  cloud  calculated  using  VAN  pose 
estimates  and  pairwise  correspondences  (same  color  scale  as  in  Fig.  4-11).  Overlaid  in  black  are  the 
tracklines  connecting  sequential  poses,  (b)  The  resulting  Delaunay  triangulated  surface,  (c)  The 
textured  mapped  surface  as  computed  by  back-projecting  the  images  onto  the  Delaunay  mesh  (the 
tiling  artifact  is  due  to  the  simple  overlay  of  images  without  blending).  The  red  regions  are  places 
where  no  camera  footprints  back-project  to  the  mesh.  In  particular,  the  red  strips  along  the  bow 
correspond  to  missing  cross-track  camera  constraints  (due  to  low  overlap)  in  the  final  pose-network 
(Fig.  4-10).  Note  that  these  regions  really  should  have  texture,  but  that  due  to  a  lack  of  camera  con¬ 
straints  there  is  residual  error  in  this  portion  of  the  pose-network  (however,  the  uncertainty  ellipses 
of  Fig.  4-10  reflect  this). 


(a)  Overhead  view  of  the  triangulated  point  cloud  and  tracklines. 


(b)  Overhead  and  side  view  of  the  Delaunay  surface. 
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Figure  4-13  A  histogram  of  VAN’s  estimated  precision,  expressed  as  a  percentage  of  distance 
traveled,  for  the  RMS  Titanic  pose-network. 
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15.5  cm).  While  the  actual  face  value  of  this  number  must  be  taken  with  a  grain  of  salt 
(since  there  is  no  ground-truth  with  which  to  verify  the  filter’s  consistency),  it  does  stress  the 
point  that  closed-loop  camera  feedback  provides  a  robust  and  improved  navigation  estimate 
even  despite  persistent  dropouts  in  the  bottom-lock  Doppler  measurement. 

Experimental  Results:  Covariance  Recovery 

Fig.  4-14  provides  a  quantitative  assessment  comparing  the  covariance  bounds  obtained 
by  our  algorithm  to  the  bounds  obtained  by  inverting  only  the  Markov  blanket  (§4.3.1). 
To  provide  a  fair  assessment,  we  choose  to  evaluate  the  relative  uncertainty  between  the 
robot,  xr,  and  any  other  map  element,  Xi.  Our  justification  for  this  metric  is  that  the 
Markov  blanket  results  in  a  conditional  covariance  that  does  not  accurately  reflect  global 
map  uncertainty,  but  rather  relative  map  uncertainty.  Using  the  information  matrix  of 
Fig.  4-1,  for  each  map  element,  x,,  we  computed  the  first-order  relative-pose  covariance 
matrix  between  it  and  the  robot.  For  our  metric,  we  chose  to  compute  the  log  of  the 
determinant  of  the  approximation  covariance  to  the  determinant  of  the  actual  covariances 
obtained  by  matrix  inversion.  Hence,  ratios  greater  than  one  (conservative)  are  positive, 
and  ratios  less  than  one  (overconfident)  are  negative.  Note  that  Fig.  4-14  highlights  that 
our  method  is  conservative  while  the  Markov  blanket  is  overconfident.  Furthermore,  for 
this  dataset  the  histogram  shows  that  our  method  tends  to  be  conservative  by  a  smaller 
margin. 

Finally,  Fig.  4-15  and  Fig.  4-16  both  demonstrate  the  actual  value  of  this  conservative 
approximation  within  the  context  of  pose-constrained  correspondence  searches.  In  partic¬ 
ular,  each  figure  shows  a  candidate  pair  of  underwater  images  and  the  predicted  epipolar 
geometry  instantiated  from  our  state  estimate.  Recall  that  for  a  calibrated  camera,  the 
epipolar  geometry  is  specified  by  the  relative  camera  pose  and  defines  a  ID  search  con¬ 
straint  [64].  However,  when  the  relative-pose  estimate  is  uncertain,  this  ID  search  con¬ 
straint  becomes  a  search  region  (§2.4.3).  Fig.  4-15  depicts  a  case  where  the  Markov  blanket 
approximation  fails  due  to  its  overconfident  covariance  estimate.  This  failure  is  indicated 
by  the  fact  that  its  6-sigma  confidence  search  region  does  not  contain  the  true  correspon- 
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Figure  4-14  A  quantitative  comparison  of  the  different  covariance  recovery  techniques  using  the 
information  matrix  of  Fig.  4-1.  These  plots  compare  the  Markov  blanket  covariance  approximation 
to  the  results  of  our  method,  both  of  which  are  shown  relative  to  the  actual  covariance  obtained  by 
matrix  inversion.  For  each  method  and  state  entry  x»,  we  compute  its  relative-pose  to  the  robot  xr 
(i.e.,  xri  =  ©xr  ®  Xj)  and  associated  first-order  covariance.  We  then  plot  the  log  of  the  ratio  of  the 
determinant  of  the  approximated  covariance  to  the  determinant  of  the  actual  covariance  to  facilitate 
comparison  of  conservativeness  (positive  values)  versus  overconfidence  (negative  values),  (a)  Plot  of 
the  log  ratio  verses  feature  id  for  all  x*.  Note  that  a  value  of  zero  is  ideal  as  this  would  indicate  a 
ratio  of  one.  (b)  Same  data  as  above  but  presented  in  histogram  form.  Both  plots  show  that  the 
method  presented  in  this  paper  is  conservative  while  the  Markov  blanket  method  is  overconfident. 
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dence  while,  on  the  other  hand,  the  regions  computed  by  the  actual  covariance  and  our 
conservative  approximation  both  do.  Hence,  for  this  image  pair,  putative  correspondence 
establishment  fails  under  the  Markov  blanket  approximation,  but  meanwhile  succeeds  with 
ours.  Additionally,  Fig.  4-16  highlights  that  the  amount  of  overconfidence  in  the  Markov 
blanket  approximation  is  unpredictable,  since  for  a  different  image  pair  it  produces  compa¬ 
rable  results. 

4.6  Chapter  Summary 

In  conclusion,  this  chapter  showed  that  the  delayed-state  view-based  SLAM  information 
matrix  is  exactly  sparse  and,  furthermore,  that  this  sparsity  is  a  direct  consequence  of  the 
process  model’s  Markovity.  Moreover,  while  the  covariance  formulation  requires  quadratic 
storage,  the  number  of  nonzero  off-diagonal  elements  in  the  ESDF  information  matrix  is 
linear  in  the  number  of  measured  relative-pose  constraints.  This  sparse  matrix  structure 
allows  for  O(n)  full  state  recovery  via  recently  proposed  multilevel  relaxation  methods, 
while  approximate  partial  state  recovery  allows  motion  prediction  and  navigation  updates 
to  be  performed  in  constant  time.  Additionally,  we  also  presented  a  novel  algorithm  for 
extracting  consistent  marginal  covariance  bounds  from  SLAM  information  filters.  These 
bounds  provide  a  conservative  covariance  approximation  useful  for  real-world  tasks  such 
as  image  link  hypothesis  and  pose-constrained  correspondence  searches.  Furthermore,  the 
technique’s  complexity  scales  asymptotically  linear  with  map  size.  Finally,  as  a  crowning 
achievement,  we  demonstrated  results  for  the  largest  underwater  visually  navigated  dataset 
to  date  using  data  collected  from  the  RMS  Titanic. 
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Figure  4-15  Performance  of  the  different  covariance  recovery  techniques  within  the  context  of  image 
registration.  In  this  example,  the  Markov  blanket  approximation  fails,  while  both  the  actual  covari¬ 
ance  and  our  conservative  approximation  succeed,  (a)-(b)  These  images  are  a  proposed  candidate 
pair  for  image  registration.  Image  (a)  represents  the  query  image  as  viewed  by  the  robot,  and  overlaid 
on  top  is  the  predicted  epipolar  geometry  (green)  instantiated  from  our  state  estimate.  Image  (b)  is 
the  proposed  candidate  for  image  registration,  and  overlaid  on  top  are  the  pose-constrained  corre¬ 
spondence  search  regions  for  6-sigma  confidence  bounds.  The  different  colored  regions  correspond  to 
the  three  covariance  recovery  methods  presented  in  this  chapter:  1)  our  conservative  method  (blue), 
2)  the  actual  covariance  based  upon  inverting  the  information  matrix  (yellow),  and  3)  the  Markov 
blanket  technique  (red),  (c)  These  images  show  a  zoomed  view  of  the  true  correspondence  (indicated 
by  the  white  arrows  in  (a)-(c)).  Careful  inspection  reveals  that  the  Markov  blanket  search  region 
(red)  does  not  contain  the  true  correspondence.  In  contrast,  both  the  actual  covariance  (yellow)  and 
our  covariance  approximation  (blue)  do. 


(a)  Query  image  and  its  epipolar  geometry. 


(b)  Candidate  image  and  its  search  regions. 
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Figure  4-16  This  figure  depicts  the  same  demonstration  as  Fig.  4-15,  but  for  a  different  image  pair. 
In  this  example,  all  three  methods  produce  comparable  results.  This  highlights  the  unpredictable 
nature  of  the  Markov  blanket  approximation. 


(a)  Query  image  and  its  epipolar  geometry. 


(b)  Candidate  image  and  its  search  regions. 
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CHAPTER  5 _ | 

_ Conclusions 


IN  pursuing  a  VAN  methodology,  this  thesis  has  advanced  the  current  state-of-the-art  in 
large-area  underwater  visual  navigation.  As  evidence  for  this  claim,  we  demonstrated 
.  successful  automatic  end-to-end  processing  for  the  largest  visually  navigated  underwater 
dataset  to  date  using  data  collected  from  the  RMS  Titanic  (866  images,  survey  path  length 
over  3  km,  and  3100  m2  of  mapped  area). 

5.1  Contributions 

This  thesis  has  made  general  contributions  to  the  understanding  of  scalable  SLAM  algo¬ 
rithms  and  systems-level  computer  vision.  In  particular,  the  contributions  of  this  thesis 
are: 

1.  We  presented  a  systems-level,  vision-based ,  6-DOF  SLAM  framework  (Chapter  2)  that 
exploits  the  additional  sensor  capabilities  of  a  calibrated  pose-instrumented  platform. 

This  top-down  systems-level  approach  allowed  us  to  overcome  many  of  the  challenging 
peculiarities  associated  with  an  underwater  environment  (e.g.,  unstructured  natural 
terrain,  low-overlap  imagery,  moving  light  source)  by  exploiting  a  priori  platform  in¬ 
formation  wherever  possible  (e.g.,  bounded  error  attitude  measurements,  scene  depth 
constraints  from  altimeter).  Additionally,  we  showed  how  pairwise  relative-pose  mea¬ 
surements  can  be  recursively  incorporated  and  fused  with  navigation  data  within  a 
delayed-state  EKF  SLAM  framework. 

2.  We  developed  a  novel  pose-constrained  correspondence  search  (Chapter  2)  that  incor¬ 
porates  a  priori  relative-pose  information  (e.g.,  sensor-based  or  posterior-based)  to 
restrict  the  correspondence  search  to  probable  regions  between  hypothesized  candidate 
pairs. 

This  search  constraint  is  based  upon  a  two-view  point  transfer  relation  that  incorpo¬ 
rates  constraints  on  both  relative-pose  and  scene  depth  (e.g.,  from  an  altimeter).  This 
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greatly  improves  the  robustness  of  putative  matching  within  a  feature-based  image 
registration  methodology. 

3.  We  presented  a  theoretical  investigation  (Chapter  3)  into  the  constant-time  sparsifica- 
tion  approximation  employed  by  sparse  extended  information  filters  for  feature- based 
SLAM . 

In  particular,  we  offered  novel  insight  into  the  effect  sparsification  has  on  the  consis¬ 
tency  of  the  SLAM  information  posterior.  Additionally,  we  also  derived  a  modified 
sparsification  rule  that  maintains  sparsity  while  yielding  results  comparable  to  the 
standard  full-covariance  EKF  approach.  However,  this  accuracy  comes  at  the  loss  of 
constant-time  computation. 

4.  We  presented  the  novel  insight  that  a  view-based  SLAM  framework  retains  exact  spar¬ 
sity  when  posed  in  the  information  form  ( Chapter  4 )  and,  therefore,  can  exploit  the 
sparse  posterior  representation  in  a  consistent  way. 

This  result  lead  to  the  development  of  exactly  sparse  delayed-state  filters  as  a  con¬ 
sistent,  scalable,  recursive,  fusion  framework  for  incorporating  relative-pose  measure¬ 
ments.  Furthermore,  ESDFs  require  only  linear  storage  and  at  most  0(n)  complexity 
per  update  where  n  is  the  number  of  map  images.  Additionally,  if  relaxation  is  per¬ 
formed  over  multiple  time-steps,  this  complexity  can  be  traded  off  for  slightly  increased 
state  recovery  error. 

5.  Lastly,  we  developed  a  novel  algorithm  (Chapter  4)  for  efficiently  accessing  and  main¬ 
taining  consistent  covariance  bounds  within  a  SLAM  information  filter  thereby  greatly 
increasing  the  reliability  of  data  association. 

The  foundation  of  this  algorithm  is  the  solution  of  a  sparse  linear  system  coupled 
with  the  application  of  constant-time  Kalman  updates.  Its  output  is  a  consistent  set 
of  marginal  covariances  estimates  suitable  for  robot  planning  and  data  association. 

5.2  Failure  Modes 

While  within  the  context  of  this  thesis  we  have  demonstrated  that  our  large-area  VAN 
framework  has  significant  advantages  over  traditional  underwater  localization  methods,  we 
are  compelled  to  also  point  out  what  we  think  are  the  weak  points  of  this  framework.  In 
particular,  we  believe  failure  modes  of  VAN  to  be: 

1.  False  positive  image  registrations  and/or  repetitive  scene  structure. 

VAN’s  closed-loop  navigation  feedback  is  derived  from  the  registration  of  image  pairs 
with  common  scene  overlap.  Since  these  pairwise  camera  measurements  are  fed  to 
the  delayed-state  filter  as  a  measurement  of  relative-pose  modulo  scale,  a  false  posi¬ 
tive  camera  measurement  could  have  devastating  consequences  for  the  updated  pose- 
network  topology.  In  particular,  a  false  image  registration  could  cause  the  state  esti¬ 
mate  to  converge  to  an  incorrect  trajectory.  Though  utilizing  a  robust  image  registra¬ 
tion  methodology  reduces  the  likelihood  of  a  false  positive  within  the  VAN  framework, 
it  does  not  protect  against  an  environment  with  repetitive  scene  structure. 
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2.  Linearization  error  and  filter  divergence. 

Recall  that  VAN  represents  the  posterior  distribution  using  its  first  two  moments 
under  a  first-order  linearized  approximation  evaluated  about  the  current  state  mean. 
The  danger  with  this  approach  is  that  if  the  current  estimate  “drifts  too  far  from  the 
truth”,  then  the  linearization  point  in  state  space  may  no  longer  be  valid  and  could 
ultimately  lead  to  filter  divergence.  One  method  for  avoiding  this  condition  is  to  keep 
linearization  error  small  by  maintaining  good  map  contact  so  that,  in  practice,  the 
state  estimate  never  drifts  too  far  from  the  truth  (for  example,  typical  grid-based 
AUVs  surveys  achieve  this). 

5.3  Future  Work 

This  thesis  has  laid  the  foundation  for  a  promising  infrastructure-free,  near-seafloor,  naviga¬ 
tion  strategy  complementing  the  low-overlap  exploratory  surveys  typical  of  AUVs.  However, 
as  always,  there  is  room  for  additional  improvement.  Further  areas  of  research  that  we  have 
identified  as  warranting  future  investigation  are: 

1.  Develop  a  fast  lookup-table  method  for  initial  screening  of  hypothesized  image  pairs. 

Since  we  are  using  a  view-based  framework,  raw  images  must  be  registered  to  ex¬ 
tract  camera  motion.  For  this  purpose,  potential  overlapping  candidate  pairs  are  first 
proposed  based  upon  probable  spatial  proximity,  and  then  verified  by  attempting  ro¬ 
bust  image  registration  (e.g.,  RANSAC,  LMedS).  The  limitation  of  this  strategy  is 
that  pairwise  image  registration  is  arguably  the  slowest  component  of  the  VAN  frame¬ 
work.  This  suggests  that  we  can  potentially  gain  an  increase  in  efficiency  by  being 
more  selective  in  the  candidate  image  pair  proposal  stage.  Ideally,  characteristics  the 
pre-screening  method  should  posses  are: 

•  have  a  near-zero  probability  of  missed  detection  so  that  overlapping  candidate 
image  pairs  are  not  passed  over  for  attempted  registration, 

•  have  a  low  false  alarm  rate  so  that  non-overlapping  candidate  image  pairs  are 
not  recommended  for  attempted  registration, 

•  be  computationally  cheap. 

As  a  possible  suggestion,  the  method  could  employ  a  look-up  table  strategy  computed 
on  feature  point  projective  invariants  (e.g.,  cross-ratios  [64]). 

2.  Develop  a  metric  for  computing  the  quality  of  feature  content  within  an  image. 

As  we  saw  with  the  Stellwagen  Bank  dataset  of  Chapter  2,  seafloor  topography  can 
vary  widely  even  within  a  single  survey  area  (e.g.,  from  a  featureless  muddy  bottom 
to  large  boulders).  Hence,  the  question  is  “should  all  images  be  considered  equal 
for  inclusion  in  our  delayed-state  vector?”  For  re-localization  purposes  we  argue  no. 
While  we  can  use  our  navigation  sensors  to  give  us  a  good  prior  for  registering  se¬ 
quential  images  of  say  a  featureless  sandy  bottom,  this  type  of  imagery  does  little  for 
us  in  terms  of  re-localizing  the  robot  after  a  large  survey  loop.  Therefore,  it  would 
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be  beneficial  to  develop  a  metric  that  distinguishes  images  based  upon  their  feature 
content  so  that  we  can  avoid  including  “featureless”  images  into  our  view-based  map. 

3.  Extend  the  concept  behind  pairwise  pose- constrained  correspondence  searches  to  work 
with  multiple  image  candidate  sets. 

Loop-closing  image  registration  is  one  of  the  most  difficult  challenges  of  the  VAN 
framework  (any  SLAM  framework  really).  When  a  good  prior  exists  for  the  camera 
motion,  we  showed  in  §2.4.3  how  to  exploit  a  pairwise  pose-constrained  correspon¬ 
dence  search  to  greatly  improve  the  robustness  of  putative  correspondence  selection. 
Unfortunately,  when  closing  a  very  large  loop,  the  accumulated  uncertainty  implies 
that  our  global  pose  prior  will  essentially  be  useless  on  a  pairwise  basis  for  restricting 
the  putative  search  space.  However,  if  we  consider  extending  the  putative  match¬ 
ing  to  collections  of  images,  then  we  can  instead  perform  a  joint  putative  matching 
across  sets  that  exploits  the  fact  that  our  relative-pose  prior  is  well  known  intra-set. 
The  main  idea  would  be  to  look  for  matches  that  are  mutually  consistent  across  the 
two  sets  (similar  to  the  joint  compatibility  data  association  strategy  of  Neria  and 
Tardos  [118]).  This  should  increase  the  “signal- to-noise  ratio”  for  putative  matching 
since  feature  similarity  measures  could  be  considered  in  aggregate  as  illustrated  by 
Fig.  5-1. 

4.  Extend  the  information  filter  covariance  recovery  algorithm  to  be  less  conservative. 

Our  covariance  recovery  algorithm  of  §4.3.2  for  SLAM  information  filters  guarantees 
that  the  marginal  covariance  estimates  are  consistent  with  respect  to  the  actual  covari¬ 
ances  obtained  by  matrix  inversion.  However,  because  we  only  update  our  covariance 
bounds  on  a  per  re-observation  basis,  they  can  become  very  conservative  if  we  go 
for  long  periods  without  re-observing.  For  example,  the  worst  cast  scenario  occurs 
when  closing  a  very  large  loop  since  only  the  bounds  within  the  viewable  vicinity 
are  updated.  Hence,  this  suggests  that  it  would  be  beneficial  to  extend  our  update 
strategy  to  include  more  than  just  the  current  state  under  view  by  propagating  the 
observation  knowledge  throughout  the  constraint  network.  Unfortunately,  not  having 
access  to  the  cross-covariances  complicates  things.  As  a  possible  suggestion,  maybe 
a  hybrid  strategy  using  covariance  intersection  and  the  fixed-size  Kalman  updates  of 
§4.3.2  would  be  fruitful? 

5.  Lastly,  develop  a  way  for  not  having  to  add  additional  robot  poses  to  the  ESDF  state 
vector  once  we’ve  collected  enough  views  to  sufficiently  characterize  an  area. 

One  criticism  of  the  ESDF  view-based  framework  is  that  exact  sparsity  in  the  infor¬ 
mation  matrix  can  only  be  achieved  by  perpetually  adding  robot  poses  over  time.  For 
example,  consider  the  case  where  we’ve  collected  enough  “views”  to  sufficiently  char¬ 
acterize  an  area.  In  this  scenario,  when  we  return  to  a  previously  mapped  area,  rather 
th^n  adding  more  states,  instead  we  should  be  able  to  just  localize  with  respect  to 
the  finite  collection  we  already  have.  However,  the  problem  with  this  strategy  is  that 
just  like  in  feature-based  SLAM,  if  we  begin  to  marginalize  out  our  robot  trajectory 
the  information  matrix  will  densify  as  illustrated  by  Fig.  5-2.  Therefore,  if  we  want 
to  restrict  our  representation  to  environment  size  (fixed)  and  not  time  (unbounded), 
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Figure  5-1  A  multi-image  joint-correspondence  search,  (a)  The  pairwise  pose-constrained  corre¬ 

spondence  search  of  §2.4.3  exploits  the  relative-pose  prior  Xy  between  images  /,  and  Ij  to  reduce  the 
putative  correspondence  search  space,  (b)  For  loop-closing,  the  accumulated  uncertainty  along  the 
loop  may  render  our  global  pose  prior  useless.  However,  we  should  still  be  able  to  exploit  the  fact 
that  our  relative-pose  information  is  accurate  for  small  sequences  of  images.  For  example,  given  the 
overlapping  collection  of  images  on  the  left,  the  relative-pose  information  x„,  xjfc,  and  xfc<  should 
allow  us  to  extend  our  correspondence  search  across  multiple  raw  frames  in  our  view-based  map  as 
shown  on  the  right.  In  other  words,  instead  of  just  relying  upon  the  pairwise  discriminatory  power 
of  our  feature  descriptor,  we  can  use  our  relative-pose  information  to  look  for  mutually  consistent 
putative  matches  across  collections  of  images  by  exploiting  the  knowledge  that  if  It  and  Ie  overlap, 
then  so  should  Ik  and  /<*.  Effectively,  this  should  increase  the  “signal- to- noise  ratio”  of  our  feature 
similarity  measure  as  putative  matches  would  be  considered  jointly  across  images  and  checked  for 
mutual  consistency.  This  should  in  turn  make  visually-based  loop-closing  both  more  tractable  and 
more  robust. 
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then  this  suggests  that  some  sort  of  approximation  is  required  (similar  to  the  prun¬ 
ing  strategies  employed  by  feature-based  SLAM  information  filters).  Hence,  further 
research  into  a  consistent,  general,  computationally  efficient,  edge  pruning  algorithm 
may  be  fruitful  for  ail  areas  of  SLAM. 
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Figure  5-2  Localizing  in  a  previously  mapped  area  fills  in  the  ESDF  information  matrix  unless  we 
continue  to  add  robot  poses  to  our  representation,  (a)  Suppose  our  view-based  map  consists  of  the 
set  of  poses  M  =  {xo>x1,x2,x3}  and  the  robot,  xr,  returns  to  this  previously  mapped  area.  For 
both  simplicity  and  clarity,  assume  that  the  robot  doesn’t  have  any  shared  information  with  the  map 
as  shown,  (b)  Now  suppose  that  the  robot  re-localizes  itself  by  making  relative-pose  measurements 
xr0  and  xri.  (c)  Suppose  that  rather  than  augmenting  our  map  representation  to  include  the  robot 
pose,  xr,  we  instead  perform  time- prediction  (i.e.,  the  robot  state  evolves  from  xr  to  xr+i).  (d)  Now 
suppose  that  the  robot  makes  relative-pose  measurements  xr2  and  xr3.  (e)  Time  propagating  the 
robot  pose  xr+i  to  xr+2,  followed  by  marginalization  over  xr+i,  results  in  a  fully  dense  information 
matrix. 
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APPENDIX  A 


Robot  System,  Models,  and  Coordinate  Frames 


JHE  aim  of  this  thesis  is  to  develop  a  novel,  scalable,  SLAM  algorithm  that  respects  the 
constraints  of  low-overlap  imagery  typical  of  AUVs  while  exploiting  the  information 
_  associated  with  the  inertial  sensors  that  are  routinely  available  on  such  platforms. 
Developing  such  a  framework  requires  that  we  make  judicious  assumptions  regarding  the 
type  of  platform  capabilities  that  can  be  reasonably  expected,  the  set  of  conventions  for 
defining  coordinate  frame  relationships,  the  level  of  detail  required  in  modeling  vehicle 
dynamics,  and  the  type  of  information  measured  by  our  strap-down  sensors.  In  this  chapter 
we  describe  our  platform  assumptions,  coordinate  frame  conventions,  and  vehicle/sensor 
models. 

A.l  Platform 

The  assumed  platform  capabilities  are  thoroughly  grounded  in  modern  oceanographic  AUV 
technology.  Much  of  the  experimental  development  work  for  this  thesis  has  been  conducted 
using  the  SeaBED  AUV  [145-147]  —  a  bottom-following,  hover-capable,  imaging  research 
platform  (Fig.  A-l)  equipped  with  a  standard  suite  of  underwater  dead-reckoned  naviga¬ 
tion  sensors  (Table  A.l).  In  particular,  its  main  navigation  source  is  an  acoustic  Doppler 
velocity  log  which  measures  seafloor  referenced  velocities  with  a  precision  on  the  order  of 
1-2  mm/s.  These  velocities  can  then  be  integrated  over  time  to  provide  a  dead-reckoned 
position  estimate  for  real-time  control.  The  accuracy  of  the  position  estimate  is  governed 
by  absolute  orientation  measurements.  In  our  case,  these  are  minimally  instrumented  using 
a  magnetic-compass  for  heading  (good  to  a  few  degrees),  and  tilt  sensors  for  roll  and  pitch 
(±0.5°).  In  addition  to  the  DVL  estimate,  bounded  error  vehicle  depth  is  measured  via  a 
Paroscientific  pressure  sensor  good  to  0.01%  precision,  which  translates  to  a  few  centimeters 
over  full  ocean  depths. 

Since  SeaBED  is  intended  to  be  a  scientific  imaging  platform,  its  uses  a  two-hull  design 
to  give  good  separation  between  center  of  mass  and  center  of  buoyancy  making  the  vehicle 
passively  pitch  and  roll  stable.  For  optical  imaging,  SeaBED  is  equipped  with  a  calibrated, 
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Table  A.l  Specifications  of  the  SeaBED  AUV  platform. 


Vehicle 

Depth  rating 

Size 

Mass 

Cruising  Speed 
Batteries 

Propulsion 

2000  m 

2.0  m  (L)  x  1.5  m  (H)  x  1.5  m  (W)  (bbox) 
200  kg 

0.15-1.2  m/s 

2  kWh  Li-ion  pack 

(4)  150  W  brush-less  DC  thrusters 

Navigation 

Depth 

Velocity 

Tilt 

Heading 

Altitude 

Angular  rates 

0.01%  Paroscientific  pressure  sensor 

±1-2  mm/s  RDI  1200  kHz  Navigator  ADCP 
±0.5°  RDI  (internal) 

±2.0°  RDI  (internal) 

0.1  m  RDI  (beam  avg.) 

l°/s  Crossbow  AHRS 

Optical  Imaging 

Camera 

Lighting 

Separation 

12bit  1280x1024  Pixelfy  CCD  (bw  or  color) 
(1)  200  W  •  s  strobe 

1  m  between  camera  and  light 

Acoustic  Imaging 

Sidescan  sonar 
Pencil-beam  sonar 

300  kHz  MST  (300  m  depth  rating) 

675  kHz  Imagenex  881 

Other  Sensors 

CTD 

Seabird  37SBI 

Figure  A-l  The  SeaBED  AUV 


(b)  Vehicle  with  skins. 


Syntactic  Foam 


Electronics  Housing 


Camera 


\ 

Battery 


Strobe 


Sidescan 


DVL 


(a)  CAD  model. 
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down- looking,  high  dynamic  range  (12bit)  color  CCD  that  can  be  swapped  out  for  a  black 
and  white  camera  depending  on  the  scientific  application.  SeaBED  relies  upon  a  2  kWh 
Li-ion  battery  pack  for  power  and,  therefore,  uses  strobe  flash  photography  to  reduce  power 
consumption  and  increase  vehicle  endurance.  Typical  survey  speeds  are  usually  in  the  range 
of  20-60  cm/s,  though,  the  vehicle  is  capable  of  obtaining  speeds  up  to  1.2  m/s.  The  former 
range  of  speeds  provide  approximately  15-35%  temporal  image  overlap  at  altitudes  ranging 
from  2. 5-4.0  m. 


A. 2  6-DOF  Coordinate  Frame  Relationships 

This  section  describes  the  relevant  reference  frames  involved  in  vehicle  navigation  and 
their  6-DOF  coordinate  frame  relationships  as  illustrated  in  Fig.  A-2.  We  follow  standard 
SNAME1  convention  [48]  and  define  the  vehicle  frame,  denoted  subscript  v,  to  be  coincident 
with  a  fixed  point  on  the  vehicle  and  oriented  such  that  the  positive  X ,,-axis  is  aligned  with 
the  bow,  positive  Y„  starboard,  and  Z„  down  completing  a  right  handed  coordinate  frame. 

Additionally,  we  must  consider  each  onboard  sensor’s  own  internal  coordinate  frame  in 
which  measurements  are  expressed  and  its  relationship  to  the  vehicle.  The  sensor  frame, 
denoted  subscript  s,  is  assumed  to  be  static  and  known  with  respect  to  the  vehicle  frame  (i.e., 
calibrated  beforehand).  Finally,  two  navigation  frames  are  defined  and  used  for  expressing 
vehicle  pose.  The  first  is  the  world  frame,  denoted  subscript  w,  which  is  a  static  reference 
frame  located  at  the  water  surface  oriented  with  Xw-East,  Y^-North,  and  Z^-Up  and  is  useful 
for  displaying  results  since  it  follows  standard  map  convention.  The  second  navigation  frame 
that  we  define  is  the  local-level  frame,  denoted  subscript  i.  This  frame  is  coincident  with 
the  world  frame,  however,  it  is  oriented  with  Xf-North,  Y^-East,  z^-Down  and  corresponds 
to  a  zero-orientation  (i.e.,  local-level)  version  of  the  vehicle  frame;  this  frame  is  useful  for 
navigation  because  standard  compass-measurements  are  consistent  with  the  right-hand  rule 
convention  about  the  z-axis. 


A. 2.1  Pose  Description 

We  adopt  the  Smith,  Self,  and  Cheeseman  [154]  coordinate  frame  notation  and  define  the 
6-DOF  pose  of  frame  j  with  respect  to  frame  i  as 

X*7  =  [  ^ij  >®ijj  =  feij ,  Dij  1  %ij ,  4*ij  1  @ij ,  ^ij]  • 

Here,  *ty  is  a  Euclidean  3-vector  from  i  to  j  as  expressed  in  frame  i,  and  ©tJ  is  a  3-vector 
of  XYZ-convention  roll,  pitch,  heading  Euler  angles.2  For  this  Euler  definition  the  3x3 


*The  Society  of  Naval  Architects  and  Marine  Engineers. 

2Note  that  we  differ  from  Smith,  Self,  and  Cheeseman  in  our  roll,  pitch,  heading  (rph)  Euler  angle 
definition  and  instead  follow  Fossen’s  [48],  which  is  standard  convention  for  guidance  and  control  applications. 
Effectively,  our  RPH  are  swapped  for  their  HPR,  thus,  the  6-DOF  relationships  presented  in  this  section 
correspond  to  permutations  of  the  relations  given  in  the  appendix  of  [154] . 
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Figure  A-2  Illustration  of  the  relevant  navigation  frames.  The  blue  and  black  frames  represent 
the  world  and  local-level  frames  respectively  which  are  coincident.  The  cyan  frame  represents  the 
vehicle  reference  frame  while  the  magenta  frame  represents  an  arbitrary  sensor  frame.  The  sensor 
and  vehicle  frames  are  attached  to  the  same  rigid  body  and  therefore  are  static  with  respect  to  each 
other. 
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orthonormal  rotation  matrix  that  rotates  frame  j  into  frame  i  is  defined  as 
JR  =  rotxyz(0;j) 

—  rotz(t/fy)T  roty(0ij)T  rotx(^j)T  (principle  rotation  sequence) 
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cos  ip  cos  9  -  sin  ip  cos  (p  +  cos  ip  sin  9  sin  (p  sin  ip  sin  <p  +  cos  ip  sin  9  cos  <p 
=  sin  ip  cos  9  cos  ip  cos  <p  +  sin  ip  sin  9  sin  (p  —  cos  ip  sin  (p  +  sin  ip  sin  6  cos  (p  . 

—  sin  6  cos  6  sin  <p  cos  9  cos  (p 

Similarly,  given  the  rotation  matrix  JR  we  can  recover  the  Euler  angles  as3 

<f>ij  =  atan2  (*  Rj  3  sin  - ‘R^cos  -‘Ria  sin  +  jR,  2  cos  ^ ) 

Oij  =  atan2(-‘R3 4 ,  JRt ,  cos tpij  +  sin  V'tj) 

=  atan2(‘R21,  *RU) 

where  the  notation  j-Rm  n  means  element  (m,n)  of  ’R.  Note  that  from  the  6- vector  pose 
description  xy  ,  we  can  conveniently  obtain  the  standard  4x4  homogeneous  coordinate 


3Note  that  this  representation  of  orientation  suffers  from  a  singularity  at  6  =  ±  ^ ,  but  that  in  practice 
our  vehicle  application  never  operates  anywhere  near  this  singular  configuration  (i.e.,  pitch  ±90°). 
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transformation  matrix  from  frame  j  to  frame  i  as 

)R  V 

°  1  * 

We  can  use  this  6-vector  representation  to  define  the  mechanics  of  some  fundamental 
coordinate  frame  operations  used  as  building-blocks  in  articulating  more  complex  coordinate 
frame  relationships.  These  operations  are  particularly  useful  in  §A.4  where  we  define  our 
strap-down  sensor  observation  models. 


Figure  A- 3  General  coordinate  frame  relations  for  three  arbitrary  frames  z,  j,  and  k. 


A. 2. 2  Head-to-Tail  Operation 

The  head-to-tail  relationship  is  a  fundamental  operation  and  is  used  to  describe  coordinate 
frame  composition.  Given  pose  vectors  Xij  and  Xjfc,  the  head-to-tail  operation  yields  frame  k 
with  respect  to  frame  z  (i.e.,  x^)  as  illustrated  in  Fig.  A-3.  Using  standard  homogeneous 
coordinate  transforms  we  can  derive  this  composition  as 

1h  =  ;h;h 

from  which  the  resulting  transform  J.H  can  be  decomposed  into  *.R  and  lt &  to  obtain  the 
pose  vector  x,^ .  Similarly,  we  can  sidestep  this  intermediate  decomposition  step  and  define 
this  head-to-tail  coordinate  frame  composition  directly  in  component  form  as 


Xifc  =  Xij  ®  xjk 

=  [^iki  Viki  ziki  4>iki@ikt1Pik\ 
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where  jj.R  =  JR  ^.R. 

The  Jacobian  is  a  useful  quantity  that  allows  us  to  compute  a  first-order  covariance  esti¬ 
mate  of  Xj k  in  the  event  x,^  and  Xjk  are  random  variables.4  For  the  head-to-tail  relationship 
this  is  given  by 


_  dxjk 

d(xij,Xjk ) 

—  [J©1  J©2] 


13x3  M 

5R 

03x3 

03x3  Kl 

03x3 

K2 

where  J©i  and  J©2  correspond  to  the  left  and  right  half  partitioning  of  J©  (i.e.,  partials 
with  respect  to  xtJ  and  Xjk  respectively)  and 


M 

Kj 


K2 


"jR-1,3 Vjk  ~  {zik  ~  Zij )  cos  ipij  ~(yik  -  yij ) 

jRv.sVjk  -  JR-2,2 Zjk  ( Zik  -  Zij)  sin  i>ij  (xik  -  Xij) 

—  *jR-3,2zjk  —Xjkcosdij  (yjk  sin  (pij  +  Zjk  cos  (pij )  sin  9{j  0 

cos  9{j  cos(tpik  -  iptj)  sec  9ik  s'm(rpik  -  rpij)  sec  9ik  0" 

-  cos  9ij  sin(tpik  -  xpij )  cos (rpik  -  rpij)  0 

J.R, ,2  sin  4>ik  +  ^R,  3  cos  <plk  sec  9ik  sin(i/',fc  -  rpij)  tan  9ik  1_ 

T  sin (<pik  -  <pjk)  tan  9ik  (JRj  3  cos rpik  +  * R^  3  sin  rpik)  sec  9ik 
0  cos {<pik  -  (pjk)  -  cos  9jk  sin (<pik  -  <pjk) 

0  sin (<pik  -  (pjk)  sec  9ik  cos  9jk  cos (<pik  -  <pjk)  sec  9ik  _ 


Example: 

Given  vehicle  pose  in  the  local-level  frame,  x^,,  we  can  compute  the  corresponding  sensor 
pose  in  the  local-level  frame  as 

X*s  =  Xiv  ©  Xvs 

where  x„*  is  the  static  sensor-to-vehicle  pose.  Assuming  that  vehicle  pose  x^  is  a  random 
variable  with  covariance  £x,v,  then  to  first-order  the  covariance  of  the  sensor  pose  in  the 
local-level  frame  is 

SX£a  =  J©lSX£vJ0l 

since  xvs  is  static  and  assumed  known  (i.e.,  xvs  is  not  a  random  variable). 

Example  A.l  Obtaining  sensor  pose  and  its  uncertainty  in  the  local- level  frame  via  compounding. 


A. 2. 3  Inverse  Operation 

The  inverse  relationship  is  another  fundamental  operation  and  is  used  for  reversing  a  coor¬ 
dinate  frame  relationship.  Given  pose  vector  Xij,  the  inverse  operation  yields  frame  i  with 

4 If  x  is  a  random  variable  with  mean  f. ix  and  covariance  £xx,  then  to  first-order  the  random  variable 
y  =  f(x)  has  mean  =  f(/ix)  an(l  covariance  £yy  =  JEXXJT  where  J  =  §£|X=M  [154]. 
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respect  to  frame  j,  (i.e.,  Xji).  Using  standard  homogeneous  coordinate  transform  notation 
we  can  derive  this  operation  as 

;h  =  ;h-' 

from  which  the  resulting  transform  ]H  can  be  decomposed  into  and  Hji  to  obtain  the 
pose  vector  x^.  Again,  we  can  define  this  coordinate  frame  operation  directly  in  component 
form  as 


\7i 


=  OX; 


IJ 


—  \xji  yVjiy  zji y  (frji y  @ji  y  V* ji ] 


-sRI 


Xij 
Vij 
Lz*j  J 


atan2 (' R31  sin ipj,  —  'R32  cos ipji,  — JR^  sinipji  +  'R2  2  cos rpji ) 
atan2(-J-Rli3,  JR,.,  cos ipji  +  JR,  2  sin  ipji) 

_  atan2(JR,2,  jR,,j 


with  Jacobian 


dxjt  _ 

Je  —  tt —  — 

OXij 


-}R 

03x3 


N 

Q. 


where 


N  = 


Q  = 


0  -‘R3  i  ( Xij  cos i’i]  +  Vij  sin  ipij)  +  z ij  cos 6ij 


jRjj  Xij  j?Ri,i  2/tj 


•‘j* 
_  Vji 

1 


(1-j^.a) 


-}R3,2  {xij  cos  tpij  +  yij  sin  V'ij)  +  <z»j  sin  0^  sin  faj  JR^Xj,  -  JR 1>2yij 
— *  R3  a  (xj j  cos  ipij  +  yij  sin  ipij )  +  Zij  sin  0ij  cos  <pij  jR2  3Xtj  JRi,jJ/tj. 

j^'1,1  j R 1 , 2  cos  <plj  j-R13jR3  3 


-JR-2,3  COS  Ipij 


_t  ft 
j1X  3,3 


Example: 

Given  the  vehicle  pose  in  the  local-level  frame,  x/w,  we  can  express  the  local-level  frame 
from  the  vehicle’s  point  of  view  as 

xvi  =  Gxev. 

Assuming  that  vehicle  pose  X£V  is  a  random  variable  with  covariance  £x*v ,  then  to  first-order 
the  covariance  of  the  local-level  frame  with  respect  to  the  vehicle  is 

=  Je£x,vJ©. 


Example  A. 2  Using  the  inverse  operation  to  express  local- level  with  respect  to  the  vehicle  frame. 
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A. 2.4  Tail-to-Tail  Operation 


Finally,  we  come  to  the  tail-to-tail  operation  which  is  a  composite  relationship  built  upon 
the  head-to-tail  and  inverse  operation.  This  composite  operation  occurs  frequently  and 
therefore  is  worthwhile  to  define  on  its  own.  The  tail-to-tail  operation  is  used  to  express  the 
relative-pose  between  two  frames  that  are  represented  in  a  common  coordinate  system.  For 
example,  given  pose  vectors  xtJ  and  xt*.,  the  tail-to-tail  operation  yields  the  relative-pose 
Xjfc.  Using  standard  homogeneous  coordinate  frame  notation  we  can  derive  this  composite 
operation  as 

Similarly,  using  the  head-to-tail  and  inverse  relationship  we  can  equivalently  define  Xjk 
directly  in  component  form  as 


Xjifc  =  Xji  ©  Xik 

=  Gxij  0  xik. 

The  associated  Jacobian  of  this  composite  operation  can  be  obtained  by  chain-rule  as 

T  _  dxjk 

©*^©  o/  \ 

dfeij  i  xt/c) 

dxjk  d{xji,Xik) 


9{xji,Xik)  d^XijyXik) 

J©  06x6 

06x6  16x6. 

=  [J©lJ©  J©2]- 


=  J 


Example: 

Given  local-level  vehicle  poses  X£Vi  and  x/v.  corresponding  to  times  tx  and  tj  respectively, 
we  can  express  their  relative  pose  xViVj  as 


XviVj  —  ©  X ivj  • 


Assuming  the  two  poses  X£Vi  and  X£Vj  are  random  variables  with  covariances  Ex^  and 

and  cross-covariance  Ex,v  X£v  ,  then  to  first-order  the  covariance  of  their  relative  relationship 

is 


—  ftJfi 


jxu.v.  —  ©J® 


xtviXevj 


JXtviXlvj 

Sx/..  . 


© 


J 


T 

©• 


Example  A.3  Using  the  tail-to-tail  operation  to  compute  the  relative-pose  between  two  time- 
sampled  vehicle  poses. 
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A. 3  Vehicle  Model 


A  vehicle  model  is  a  useful  mathematical  concept  used  to  describe  the  change  in  vehicle 
state  in  response  to  control  inputs  (or  lack  thereof)  and  is  often  a  required  component 
in  a  general  sensor  fusion  framework  that  tries  to  estimate  the  vehicle’s  state  based  upon 
noisy  sensor  measurements;  see  any  of  [7, 18,  54]  for  reference.  We  choose  to  model  the 
vehicle  dynamics  using  a  6-DOF  constant  velocity  process  model.  This  model  sufficiently 
captures  the  characteristically  slow-dynamics  of  most  underwater  imaging  platforms  used 
in  a  structured  survey  context  and,  additionally,  can  be  applied  across  multiple  vehicle 
platforms  without  reformulation.5 


A. 3.1  6-DOF  Continuous-Time  Constant  Velocity  Process  Model 

Deterministic  Description 

The  vehicle  state  xv  is  defined  by  the  12-vector: 

=  [etJv,  &JV,  VT,  vu>t]T  (A.l) 

where  X£V  =  [*t*t,T,  ©J,]"1"  is  the  local-level  vehicle  pose  as  defined  in  §A.2.1,  vu  =  [u,v}w] 
are  the  body-frame  linear  velocities,  and  vu)  =  [p,  q,  r]  are  the  body-frame  angular  rates. 
Under  the  constant  velocity  approximation,  this  state  description  allows  us  to  define  the 
deterministic  component  of  the  continuous-time  motion-model  as 


xv(t)  =  f{xv(t)) 


'et(v' 

■JRV 

d 

®(v 

Jv  u 

dt 

V 

03x1 

mVU/_ 

—1 

X 

CO 

o 

(A. 2a) 


(A. 2b) 


where  *R  is  the  orthonormal  rotation  matrix  rotating  the  body-frame  velocities  into  the 
local-level  frame  and  J  is  a  3  x  3  matrix  mapping  body-frame  rates  to  Euler  rates.  Note 
that  both  and  J  have  a  nonlinear  dependence  on  ©*v.  The  mapping  J  can  be  derived 
from  first-principles  by  considering  the  inverse  relationship  whereby  the  principle  rotation 
sequence  rotz(V>)  — ►  roty(0)  — ►  rotx(0)  (see  §A.2.1)  is  used  to  map  Euler  rates  to  body  rates 
as 


p 

9 

— 

o 

1 _ 

+  rotx(</>) 

"O' 

9 

+  rotx(<£)  roty(0) 

'O' 

0 

r 

o_ 

0_ 

A 

t— ‘ 

o 

1 

►2 

5’ 

_ i 

0  cos  sin  cos  9 

9 

0  —  sin  0  cos  0  cos  9 

A 

J-1 


5The  process  noise  autocorrelation  matrix  Q  in  (A. 3)  may  need  to  be  appropriately  tuned. 
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Thus,  the  mapping  from  body-frame  rates  to  Euler  rates  is  given  by 


'1 

0 

—  sin0 

-1 

'1 

sin  <\>  tan  0 

cos  cf>  tan  0 

J  = 

0 

COS  </> 

sin  cp  cos  6 

= 

0 

COS  (j) 

—  SlTUf) 

0 

—  sin0 

cos  (p  cos  6  _ 

0 

sin  </>  sec  0 

cos  0  sec  6 

In  §A.3.2  the  process-model  Jacobian  will  be  required,  so  for  completeness  we  present 
it  here.  To  derive  the  Jacobian  we  first  define  the  following  quantities: 


'i 

0 

0 

cos  0  0- 

sin  0 

cos  ip 

sin  ip  0 

R<p  = 

0 

cos  <p 

sirup 

R*  = 

0  1 

0 

— 

—  sin  ip 

cos  ip  0 

0 

-  sirup 

cos  <p 

sin0  0 

cos  0 

0 

0  1 

'0 

0 

0 

—  sin  0  0 

—  cos  0 

—  sirup 

cos  ip 

O' 

R<p  = 

0 

—  sirup 

COS  (p 

TU  = 

0  0 

0 

R/0  = 

—  cos  ip 

—  sirup 

0 

0 

-  cos  <p 

—  sirup 

cos0  0 

—  sin0 

0 

0 

0 

'0 

cos  (j)  tan  6 

—  sin  (j>  tan  0 

'0 

sin  (j)  sec2  0 

cos  <\>  sec2  6 

0 

—  sin0 

—  COS  (f> 

Je  = 

0 

0 

0 

0 

cos  (\>  sec  6 

—  sin  (f)  sec  6 

0 

sin  (j)  sec  0  tan  0 

cos  (j)  sec  0  tan  6 

Using  the  above  definitions  the  nonzero  partials  of  the  process- model  Jacobian  are: 

=  'R 

=  J 

Hence,  the  total  process  model  Jacobian  is  given  by 


F-  0x» 

03x3 

O&tv 

cFV 

03x3 

03x3 

0&lv 

03x3 

dvuj 

06x3 

06x3 

06x3 

06x3 

dei 


tv 


d<p 

d&ev 

d(p 


=  r}rJrJv 


=  j<pVU> 


)TnTDT«, 


^^tv  _  n  I  n  In  I  V 
QQ  ^ <t>  ^ 

_  '7 

~ar~Je 


_  6  T  r,  T  r,  T  u , 


dip 

d&ev 

dip 


=  r^r;v 


—  03x1 


d&fy 

dvu> 


Probabilistic  Description 


Clearly,  the  deterministic  motion-model  of  (A. 2)  represents  an  approximation  to  the  true 
vehicle  dynamics.  As  is  typical  engineering  practice,  we  account  for  this  modeling  error  by 
augmenting  our  motion-model  to  include  a  probabilistic  term  which  reflects  our  negligence 
of  model  detail: 


x„(f)  =  f(xv(f))  +  Gw  (t). 


(A.3) 


Here,  G  — 


06x6 

16X6 


is  a  gain  matrix  mapping  the  6-vector  zero-mean  white  Gaussian  noise 


process  w(£)  with  covariance  £?[w(£)w(t)t]  =  Q S(t  —  r)  to  the  rate  derivatives  vis  and  vu 
This  white  process  noise  reflects  the  fact  that  the  rates  Vv  and  vu  of  (A. 2)  are  not  truly 
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constant,  but  instead  are  approximated  by  a  random  walk  variation  in  time. 

A. 3. 2  Continuous-Time  to  Discrete-Time  Mapping 

While  the  continuous-time  formulation  of  (A. 3)  is  the  most  natural  representation  for  ana¬ 
lytically  expressing  the  vehicle  dynamics,  it  must  be  translated  from  its  native  formulation 
into  a  discrete-time  representation  if  we  are  to  implement  it  within  a  computer  algorithm. 
One  straightforward  technique  (which  conveniently  requires  zero-reformulation)  is  to  use 
a  fourth-order  Runge-Kutta  numerical  integration  strategy  [119]  to  propagate  the  vehicle 
state  forward  in  between  discrete  samples  in  time  (e.g.,  this  technique  can  be  used  within 
a  continuous-discrete  Kalman  filter  formulation  [54]  to  advance  the  vehicle  state  between 
asynchronous  sensor  measurements  as  described  in  Chapter  2).  However,  for  the  informa¬ 
tion  framework  presented  later  in  Chapter  4  it  will  prove  useful  to  define  the  vehicle  model 
directly  as  a  discrete-time  difference  equation. 


Piecewise-Constant  Discrete-Time  Difference  Equation 

Suppose  that  at  time  tk  we  have  an  estimate  for  xv(tk)  denoted  /itfc  —  expanding  (A. 3)  in 
a  Taylor  series  about  this  point,  we  get 


x„(0  =  f intk)  +  Fx  (x„(0  -  fitk)  +  HOT  +  Gw((). 


Here,  F„  = 


is  the  process  model  Jacobian  and  HOT  denotes  higher-order  terms 
in  the  expansion.  Dropping  the  higher-order  terms  and  rearranging  we  have 


xv(<)  «  Fxxv(f)  +  (f(/itfc)  -  Fx/i(J  +Gw(f) 

' - V - ' 

=  Fxxv(<)  +  u  (tk)  +  Gw(t) 

where  (f(/xtfc)  —  Fx/i*fc)  masquerades  as  a  constant  input  pseudo  control  u (tk). 

Assuming  that  u(£*)  and  Fx  remain  constant  over  a  short  time  step  At  =  [tk,tk+ 1)>  we 
can  piecewise  approximate  our  nonlinear  partial  differential  equation  (PDE)  (A. 3)  over  the 
time  interval  At  with  the  constant  coefficient  PDE  specified  in  (A. 4) 


xv(t)  =  F  Xxv(t)  +  Bu  (t)  +  Gw  (t)  (A. 4) 

where  B  is  simply  the  identity  matrix  and  u (t)  =  u (tk)  for  tk  <  t  <  tk+\.  Now  that  we  have 
a  linear  constant  coefficient  PDE,  we  can  map  this  directly  to  a  discrete-time  difference 
equation  sampled  at  arbitrary  times  [7]: 

xv[4+i]  =  F  kxv[tk]  +  Bku[tk\  +  w  [tk\.  (A. 5) 

To  evaluate  the  parameters  of  (A. 5)  we  note  the  following.  For  the  constant  coefficient  PDE 
of  (A. 4)  the  corresponding  discrete-time  state  transition  matrix  Fk  is  given  by 
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The  discrete-time  control  gain  B*  is  computed  as6 


r*fc+i 


B  k  —  [  eFx^fc+1  T^B  dr  =  f  eFx(*fc+i  T)fo  =  eFx4+i 

Jtk  Jtk 


ftk+l  i? 

/  e~FxTdr. 

Jtk 


And  finally,  the  zero-mean  discrete-time  white  Gaussian  process  noise  w  [tk]  is  related  to 
the  zero-mean  continuous-time  white  Gaussian  noise  w(t)  by 

w  [tk]  —  f  e(lfc+1"r)Gw  (r)dr 

Jtk 

with  covariance 

Q*:  =  E  w[4]w[4]Tj  =  £  eFx(<fc+i“T)QQ(r)GTeFx  (tfc+i“T)rfr. 


For  the  fixed  parameter  case  of  (A. 4)  the  evaluation  of  Q*  simplifies  considerably  and 
can  be  computed  using  Van  Loan’s  method  [18,  §5.3].  In  summary,  Algorithm  A.l  describes 
the  continuous  to  discrete  mapping  as  we  implement  it. 


A. 4  Sensor  Observation  Models 

Using  our  description  of  vehicle  state  defined  in  (A.l),  in  this  section  we  describe  our  models 
for  strap-down  sensor  measured  quantities.7  The  main  idea  behind  an  observation  model  is 
to  use  our  knowledge  of  vehicle  state  to  try  and  “predict”  sensor  observed  quantities  in  the 
sensor  frame  of  reference  —  the  reason  for  choosing  the  sensor  frame  of  reference  is  because 
this  is  where  the  measurement  “noise”  is  most  naturally  expressed.  Within  a  typical  state 
estimator  framework,  the  discrepancy  between  predicted  and  measured  quantities  is  used 
in  a  weighted  update  to  modify  our  belief  in  what  the  sensor  is  telling  us  versus  what  our 
kinematics  (i.e.,  process- model)  tell  us. 

Note  that  in  deriving  the  following  observation  models  we  assume  that  the  sensor  to 
vehicle  relative-pose  quantity  xvs  =  [vtvs,  @vs]  is  known  from  calibration  and  that  the 
vehicle  state  xv  is  defined  by  (A.l). 


A. 4.1  DVL  Observation  Model 

When  operating  in  bottom-lock  mode,  the  DVL  measures  seafloor-relative  velocities  as  ex¬ 
pressed  in  its  sensor  reference  frame  (refer  to  §1.3.2  for  a  more  in-depth  discussion).  Ac¬ 
counting  for  the  moment-arm  between  the  body-fixed  vehicle  frame  origin  and  the  sensor 


6Note  that  a  closed-form  solution  does  not  exist  because  the  Jacobian  Fx  is  singular,  therefore,  we 
numerically  evaluate  the  definite  integral  using  a  sufficient  number  of  intervals  and  Simpson’s  Rule  [119]. 

7 Camera  derived  quantities  are  discussed  in  Chapter  2  where  we  introduce  the  concept  of  visually  aug¬ 
mented  navigation. 
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Algorithm  A.l  Piecewise-constant  continuous-time  to  discrete-time  mapping. 

1:  for  time  step  t*  do 

2:  Evaluate  the  continuous-time  process-model  (A. 3)  about  the  linearization  point  fxlk 

to  get  f {ntk)  and  the  Jacobian  Fx. 


3:  Form  the  block-matrix  X. 


— Fx 

GQGT 

0 

ft 

1  X 

4:  Evaluate  the  matrix  exponential  and  call  it  Y. 


Y  =  eXAt  = 


0 

TI 

i _ 

5:  Ft  <—  (lower- right  block  of  Y)T 


6:  Qit  <—  Ffc  x  (upper-right  block  of  Y) 

7:  u[t*]  «-  f(ftth)  -  F xntk 

g  B  <—  eF«it+i  ftk+1  e~FxTdr  numer'ca^y  evaluate  over  a  sufficient  number 
"*tk  of  intervals  using  Simpson’s  Rule 

9:  Evaluate  the  discrete-time  model  (A. 5)  for  the  current  set  of  parameters. 

10:  end  for 
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frame  we  have  [70] 


=  ;r(v-[x,]x”«) 

where  the  [.]  x  operator  represents  a  skew-symmetric  matrix  implementing  the  vector  cross- 
product.8  Therefore,  the  linear  state  observation  model  is  given  by9 


z  =  Hxv 

=  [03X3  03x3  SVR  -*R[Vtvs]x]  xv. 

A. 4. 2  Angular  Rate  Sensor  Observation  Models 

Since  the  sensor  and  vehicle  can  be  considered  as  a  rigid-body,  the  sensor  measured  angular 
rates  are  simply  the  vehicle  angular  rates  rotated  into  the  sensor  coordinate  frame: 

^sensor  =  ^ ' 

Therefore,  the  linear  state  observation  model  is  given  by 


z  =  Hxi; 

=  [03x3  03x3  03x3  £R]  *v 


A. 4. 3  Attitude  Sensor  Observation  Model 

For  the  DVL  and  angular  rate  sensor,  the  rate  measurements  they  provide  are  described  in 
a  sensor  relative  coordinate-frame  that  does  not  depend  upon  the  definition  of  an  external 
reference  frame.  However,  in  the  case  of  absolute  orientation  measurements  as  measured  by 
an  attitude  module  consisting  of  a  compass  and  tilt  sensors,  its  definition  of  orientation  is 
with  respect  to  a  particular  reference  frame.  Therefore,  when  deriving  the  attitude  sensor 
observation  model  we  must  consider  that  the  sensor’s  external  reference  frame,  denoted 
subscript  r,  may  not  coincide  with  the  local-level  definition  used  by  the  vehicle.  Hence,  the 
static  pose  of  local-level  with  respect  to  the  sensor  reference  frame  (i.e.,  xr*)  will  have  to 
be  considered  in  the  general  case. 

The  predicted  sensor  pose  is  given  by 


xrs  =  xT(  ©  (x£l)  ©  xvs) 

8The  skew-symmetric  matrix  S  for  the  3- vector  s  =  [si,  S2, 53]T  is  given  by 


S  =  [s]  x 


0 

-53 

52 

53 

0 

-S\ 

“52 

5l 

0 

9We  model  the  DVL  as  providing  a  3- vector  measurement  of  Euclidean  velocities  as  measured  in  the  sensor 
frame  when  in  actuality  it  measures  velocity  components  along  each  of  its  4  acoustic  beams.  Unfortunately, 
our  data  logging  strategy  only  records  the  resolved  3- vector  sensor  velocities  which  prevents  us  from  modeling 
beam-level  detail.  However,  note  that  to  model  the  actual  beam  velocity  measurements  the  only  modification 
that  is  required  is  to  premultiply  H  with  the  static  beam-geometry  transformation  matrix  T  [131]. 


158 


with  relevant  partial  derivative 


dXrs 

dx(v 


dx.rs 

dxts 


(xr<,  x<a) 


•  J©1 


(x*v,  XVa) 


However,  we  are  only  interested  in  the  attitude  portion  which  can  be  extracted  as 

® sensor  =  [03x3  ^3x3]  Xrs. 

' - v - ' 

A 

Therefore,  the  nonlinear  observation  model  is  given  by 


z  =  h(xv) 

=  A(xrf  ©  (xev  ®xvs)) 


with  Jacobian 


Hx  =  A(J02 


(xr*>  X,a) 


‘  J©1 


(x€v,  Xv 


J- 


A. 4. 4  Depth  Sensor  Observation  Model 

For  the  depth  sensor,  the  predicted  sensor  pose  is  given  by 


X*S  =  X£V  ©  XVS 


of  which  we  are  only  interested  in  the  Z-component  which  can  be  extracted  as 

^sensor  —  [0>  0*  0,  0]  X£s. 

S - V - ' 

A 

Therefore,  the  scalar  nonlinear  observation  model  is  given  by 


with  Jacobian 


z  =  h(xv) 

=  A(xev@xvs) 


Hx  —  AJ0i. 
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APPENDIX  B _ , 

i  i 

I _ Accompanying  Derivations  for  SEIFs 


IN  this  chapter  we  derive  from  first  principles  some  of  the  expressions  of  the  Gaussian 
information  form  that  were  given  in  Chapter  3.  In  particular,  we  derive  the  expressions 
.  for  marginalization  and  conditioning  in  the  information  form,  as  well  as  the  expressions 
for  the  both  the  original  and  modified  SEIF  sparsification  rules. 


B.l  Mechanics  of  the  Information  Form 

Suppose  the  normal  random  variable  £  is  written  in  partitioned  form  as  £  =  [aT,/3T]T 
where 


p{£)=Af{p,T)=Af  1  (r/,A) 


=»  p(a,/3)  =  Af( 


Aa 

^qo  ^a0 

Va 

Aqq  Aq^3 

.^0. 

> 

S/to  ^00. 

)  —  j\  V 

M0. 

1 

Apa 

) 


and  by  definition  A  =  E  1  and  77  =  A fi.  In  the  following  we  derive  the  fundamental  prob¬ 
abilistic  operations  of  marginalization  and  conditioning  in  the  information  form.1 


lrThe  marginalization  and  conditioning  expressions  we  derive  hold  in  the  general  case  for  any  partitioning 
of  £  since  we  can  always  reorder  the  elements  into  the  form  £  =  [c*T,/3T]T  via  an  appropriate  orthonormal 
permutation  matrix  A.  For  example,  suppose  £  =  [aT,bT,cT]T  with  p(()  =  E)  =  (ri,A)  and 

we  want  =  [aT,cT,bT]T,  then  ('  =  A£  where 


A  = 


la  X  a 

Ocxa 
Ofe  X  a 


0ax6  Oaxc 
0cx6  IcXc 

Ifcxb  Obxc 


and  since  is  a  linear  transformation  of  £,  it  remains  a  normal  random  variable  with  statistics 

pi  =  A  pi  rj'  =  A '  \jl  =  AAAT  A/i,  =  A  77 

Y!  =  AEAt  an  A'  =  (E'r1  =  A-tE"1A‘1  =  AAAT 
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B .  1 . 1  Marginalization 

Suppose  we  want  the  distribution  over  a  only,  then  to  obtain  p(a)  we  must  marginalize 
out  /3’s  cumulative  effect  via  integration: 

a,(3)d(3  =  J  V(/x,E)d/3  = 


1  (rj,  A)d(3. 


Covariance  Form 


In  covariance  form,  rather  than  actually  computing  the  indefinite  integral  over  /3,  we  can 
obtain  the  same  result  by  considering  the  following  linear  transformation: 

z  =  a  =  A£  where  A  =  [lQXa  0QX/3], 

which  implies  that  the  mean  and  covariance  of  the  normal  random  variable  z  are  given  by 


nz  =  An  =  na 


E«  =  AEAt 


=  E„ 


(B.l) 


Hence,  marginalization  is  a  constant-time  operation  in  covariance  form,  because  we  simply 
extract  the  appropriate  sub-elements/sub-block  from  the  mean- vector /covariance-matrix 
respectively. 


Inversion  of  a  Partitioned  Matrix 

Before  we  derive  the  equivalent  operation  in  the  information  form,  a  useful  result  from 
linear  algebra  that  we’ll  employ  is  the  inversion  of  a  nonsingular  block  2x2  matrix  which 
we  repeat  here  for  convenience.  Quoting  Bar-Shalom  [7,  §1.3.3]: 


The  inverse  of  the  (nonsingular)  n  x  n  partitioned  matrix 


Pll 

Pl2' 

-1 

'V11 

V12' 

P21 

p22 

V21 

V22. 

where  Pn  is  n\  x  ni,  P12  is  n\  x  712,  P21  is  n<i  x  ni,  P22  is  n 2  x  n 2  and  n\  +  n<i  =  n, 


has  the  partitions 

V11  =  Pn1  +  Pri1Pi2V22P2iPu1  =  (Pll  -  Pl2P22P2l)-1  (B.2) 

V 12  =  -Pn1Pi2V22  =  -VnPi2P2"2  (B.3) 

V21  =  -VssPjfiPri1  =  -P2-2P2iVn  (B.4) 

v22  =  P2"2  +  P22P2iVnPi2P22  =  (P22  -  P^P^Pn)-1  (B.5) 
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Information  Form 

To  obtain  the  expression  for  marginalization  in  the  information  form  we  simply  transform 
the  covariance  form  result  of  (B.l)  according  to 


(B-2)  a 

=  A, 


aa  AapAppApa 


Vz  =  A  zzHz 

=  (AaQ  —  Aa@App  A^a)/i.Q 
=  AQa/iQ  —  AapAppApafiQ 

=  (Va  -  AP«Vp)  ~  AapApp(r)p  -  Appfip) 

=  r/Q  —  Apafip  —  AapAppTfp  -f  AapAppAppfip 

=  T]a  —  AapAppT)p. 


Hence,  while  marginalization  is  a  constant-time  operation  in  covariance  form,  it  is  minimally 
a  cubic  operation  in  the  size  of  (3  in  the  canonical  form  due  to  the  inversion  of  App  in  (B.6). 


B.l. 2  Conditioning 

Suppose  we  want  the  distribution  over  a  conditioned  on  the  random  variable  /3,  then  to 
obtain  p(a|/3)  we  must  compute 

(  ifl'i  _ 

V  a  p(/3)  fN(n,Y,)da.  f  J\f~1(rj,A)da 


Covariance  Form 

In  covariance  form,  the  expressions  for  the  conditional  mean  and  covariance  are  well-known 
and  are  given  by  [7] 

A*a| p  =  Va  +  ^ap^ppifi  ~~  Pp)  .  * 

^ a\P  =  ^aa  —  ^ap^pp^pa- 

Hence,  while  marginalization  is  a  constant-time  operation  in  covariance  form,  conditioning 
turns  out  to  be  minimally  a  cubic  operation  in  the  size  of  /3  due  to  the  inversion  of  E pp  in 
(B.7). 


Information  Form 

To  obtain  the  expressions  for  conditioning  in  the  information  form  we  simply  transform  the 
covariance  form  result  of  (B.7)  according  to 

Aq|/3  =  ^Q\P  P  =  ^a\P^a\P 

=  a,)"1  =A„„(/1  „  + 

(B=?>  A00  '  =  ’  Am  -  A;X„(0  -  „„))  (B.8) 

=  (A QQ/xa  +  Aa0/j.p)  -  A apP 

=  VQ-  AapP- 


163 


Hence,  conditioning  is  a  constant-time  operation  in  the  canonical  form,  because  we  simply 
extract  the  appropriate  sub-elements/sub-block  from  the  informat ion- vector /information- 
matrix  respectively. 


B.2  SEIF  Sparsification  Rule 


We  begin  by  writing  the  sparsified  posterior  approximation  as  a  ratio  of  the  three  individual 
distributions  pB,  pc,  Pd  as  described  in  §3.3.1: 


PSEiF(xt,m°,m+,m  )  = 


Pb(x(, m+|m  =q) 
pe(m+|m~  =  «) 


P/)(m°, m+, m  ). 


(B.9) 


Next,  we  calculate  the  individual  terms  of  (B.9)  by  marginalizing  and  conditioning  over  our 
original  distribution  p(xf,  m°,  m+,  m-)  =  N{£t\nt,  Et)  =  A/"-1(£f;T7t,  A(). 

B.2.1  Calculation  of  p,i(xt,  m°,  m+|m_  =  a) 

This  intermediate  distribution  will  be  used  to  compute  both  pb  and  pc  and  is  obtained 
from  our  full  posterior  p(xt,  m°,  m+,  m_)  by  conditioning  on  the  passive  features  with  a 
realization  of  alpha  (i.e.,  in  =  a): 


Pa (x(,m°,m+|m  =  a)  =  A^(Sjt  Tno  m+^;  nA,  £4)  =  jsf  1(SjtiWEo ,m+€t;»7yi.  A>»)- 


Covariance  Form 


Pa  —  Sx,m°,m+Pt  +  EtSm-  (S^-E*Sm-)  (a  - 

(<Pt  Pot) 

—  Sjjmo?m+£*Sx>mO  m+  —  Sljnojn+T,tSrn-  (S^_EfSm-)  S^_E*Sx>mojm+ 


(B.10) 


Information  Form 


T  T  ' - A - s 

^7 A  _  ^xt,m° ,m+  AtSm-Qc 

m+  fat  ~  *lot) 

A A  —  AtSXtfmo>m+ 


(B.ll) 


B.2. 2  Calculation  of  p£(xt,m+|m  =  a) 

This  distribution  is  obtained  by  marginalizing  out  deactive  features,  m°,  from 

PB(xt,m+|irT  =  a)  =  J  p^(xf,m°,m+|m_  =  a)  dm0 

=  -V(S HB,  Eb)  =  AT1  (Sjtim+&;  r,B,  AB). 
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In  the  following,  P  denotes  a  projection  matrix  like  S,  but  only  for  a  subspace  of  £t. 


Covariance  Form 


=  S*,m+  (l*t  +  Va) 


SB  =  Plm+^P 


xt  ,m+ 

sT 


xt,m+ 


=  Sjim+EtSx,m+  -  Sjm+  EtSm- 


(B.12) 


Information  Form 

VB  =  Pjt,m+T?^  ~  Pjt,m+^'4Bmo(P^0A>lPm0)  P^o*?/! 

=  Sjt,m+  (f|t  -  Va)  ~  Sjt,m+AtSmo(S^oAtSmo)_1S^o  (fit  -  fla) 

AB  =  Pjt,m+A'4pit,m+  -  Pjt,m+A>'Pmo(PmoA>»Pm‘>)  ’pIoA^P^^ 
=  ®itlm+  AtSItitn+  —  Sj(Tn+ AjSmo  (S^oAtSmo)  S^oAtSItm+ 


(B.13) 


B.2.3  Calculation  of  pc  (m+ 1 m  =  a) 

This  distribution  is  obtained  by  marginalizing  out  both  the  deactive  features,  m°,  and  the 
robot  state,  xt,  from  p a  • 


Pc(m+  |m 


=  a)  =  JJpa  (xt,  m°,  m+  |m  =  a)dx.t.dm° 

=  Af(  S^*,;  pc,  Sc)  =  AT1  (S^*;  Vc ,  Ac) . 


Covariance  Form 

Vc = P m+^A 

=  Sm+  (Mt  +  /O 
Sc  =  Pm+S/lPm+ 

=  S^+EtSm+  -  S^+EtSm-  (S^_EtSm-)-1S^_EtSm+ 


Information  Form 


VC  ^171+ A/l^xt.m0  (^xt,m°  A^^x^m0  )  ^ xt,m0^ A 

~  Sm+  (r/t  —  T7Q)  —  Sm+  A*SXt  mo  (SXtjmoAiSXt>mo)  SXt>mo  (*7*  —  Tja) 
Ac  =  Pm-»-A>4Pm+  —  Pm+ A>*PXt  mo  (PIt  moAy\PXt  mo)  Pxtim°  A^Pm+ 

=  S^+A*Sm+  —  S^+  A*SXt|Tno  (Sjt>moA(SXtmo)  SjtjTnoAtSm+ 


(B.14) 


(B.15) 
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B.2.4  Calculation  of  p£>(m°,  m+,  m  ) 


This  distribution  is  obtained  by  marginalizing  out  the  robot  state,  x< ,  from  our  original 
posterior  p(xt,  m°,  m+,  m“): 


Pfl(m°,  m+,  m 


)  =  J p(xt,m°,m+,m  )dxt 

=  ^D^d)  =  N  1  (S^,o  m+>m-  £t\ Vd,  ^d)  ■ 


Covariance  Form 


=  S^0m+>m-StSm0im+im- 


(B16) 


Information  Form 


Vd  —  SmO  m+  m-^t  -  ^xt  (Sjt  AtSIt)  Sj(7/t 

=  Smom+m-  A(Smom+m-  —  Smo)m+jTn-  AtSI(  (SIt  A(SIt) 


-1CT 


(B.17) 


B.2.5  Calculation  of  pseif^, m°, m+, m  ) 


To  calculate  the  SEIF  sparsified  posterior  we  combine  the  three  distributions  pn  (B.12)- 
(B.13),  pc  (B.14)-(B.15),  and  pp  (B.16)-(B.17)  according  to  (B.9): 


PSEiF(xt,m°,m+,m  )  = 


Pe(xt, 


m '  m 


(  o  +  —  \ 

Pd(*h  . m  ,  m  ) 


pc  (m+|m_  =  a) 
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Covariance  Form 


PSEIF  (xt  ,m° ,  m+ ,  m  ) 

«  exp{-5(SJ,,m+^t  -  -  Mb) 

+I(S^+^-Mc)T^1(S^^-Mc) 

“5(Sn;°,m+,m-^  ~  Md)  Sp1  (Sm°,m+,m- Zt  ~  Md)  } 

=  exp  j-i(££  -  (nt  +  /ia))TSItim+Eg1Sjt  m+  (£t  -  (/it  +  Ma)) 
+  $(&  “  (Mt  +  Ma))TSm+S31S^+(^£  -  {Ht  +  flj) 

— j(^t  ”  Mt)  Smo)m+tTn-ED1S3,oim+<m-  (£t  -  Mt)  | 


defining  Ee  —  SXt  jn+EB  Sx^m+  —  Sm+Ec  Sm+  and  EF  —  Smojm+m-ED  Smo>m+  m- 
=  exp|-?Ut  -  (Mt  +  Ma))TSi1Ut  -  (Mt  +  Ma)) 

5  (^t  Mt)  V(€«-M*)} 

=  exp|  — 5  SF  £t  —  24 1  SF  (Mt  "I"  Ma)  (Mt  +  Ma)  ^E  (Mt  "I"  Ma) 

+£7 SFJ£t  —  2^7 SFVt  +  Mi  SFVt)  } 

=  exp  j— (E^1  +  EFx)4t  -  2*7  [E^(Mt  +  Ma)  +  S^Mt] 

+  [(Mt  +  Ma)TSi1(Mt  +  Ma)  +  m7 sBVt]  )  } 

=  exp|—  5  (€t~ (Efi1  +  E/r1)  [(S#1  +  E^.1)^  +  E^Ma] )  x 

(SB1  +  SB1)  (**  -  (SB1  +  SB1)-1  [(SB1  +  SBx)Mt  +  s bjm, 

=  exp|- 5 (*£  ~  [Mt+ (SB1  +  SB1)  SFVaj)  x 

(S^1  +  SFJ)  (^t  —  [mi  +  (S E  +  S Fl)  T,e  Ma  )  I 
=  exp|-i(^£  -  Mt)TSt_1  (*t  -  Mt)} 
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where 


Pt  —  Pt  +  Pa 

=  Ht  +  Et(SItm+EB1Sjtm+  —  Sm+  2(7*  S^+ )  na 

Et  =  (  E^  +  E^1)-1 

=  ^it,m+  —  Sm+  Sm+  +  Smo  m+  m-ED  Smo  m+ m- 


(B.18) 


Information  Form 


PSElF(xt,m°,m+,m  ) 

<x  exp|-i(Sjt  m+^)TAs(Sjt  m+4t)  +vl(Sxt ,m+St) 

+\(SJm^t)TAc(SJm+^)-vl(Sl^t) 

5  (^m°,m+,m~  ^t)  ^o(S^0im+  m-4t)  +  vJ)(Sm0,m+ ,m~  Zt)  | 

exp |  2 £t  ^xt,m+ ^B^xt,m+£t  "b  (^it.m+^s)  £t 

+ltJsm+\csl+tt-{sm+Tlc)Ttt 
=  expj-±^At£tT  +  Tj^t} 


where 


Vt  —  SXt yTn+T)B  ^m+Vc  ^m0 ,m+  ,m~  V D 

At  —  SXt  t77l+  A^Sa,f  m+  —  Sm+AcSm+  +  Smoim+>m- A/)Smo>m+ 


(B.19) 


B.3  Modified  Sparsification  Rule 


We  begin  by  writing  the  sparsified  posterior  approximation  as  a  ratio  of  the  three  individual 
distributions  pu,  Pv,  Pd  as  described  in  §3.3.2: 


PModRule  (xt  i  m° ,  m+ ,  m  ) 


pt/(xt,m+) 
pv(  m+) 


pp(m°,m+ 


,m 


■)• 


(B.20) 


Next,  we  calculate  the  individual  terms  of  (B.20)  by  marginalizing  and  conditioning  over 
our  original  distribution  p(xf,  ra°,  m+,  m~)  =  Af(£t\ Et)  =  A/"-1  (£*;  T7t,  At). 
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B.3.1  Calculation  of  pt/(xt,  m+) 

This  distribution  is  obtained  by  marginalizing  out  both  the  deactivated  features,  m°,  and 
the  passive  features,  m~,  from  our  original  posterior  p(xt,  m°,  m+,  m~): 


pi/(xt,m+)  =  JJ  p(xt, m°,m+,m  )dm°dm 

=  xt  ,m+ ;  Pu  >  Sl/  )  =  1  (S ,m+ it ;  Vu  >  A{/  )  • 


Covariance  Form 


l*U  ~  Sxt,m+^t 

^ 1/  =  Sj(>m+£tSItjm+ 


(B.21) 


Information  Form 

=  —  SItim+ AtSmom-  (Sm0m- AtSmom- )  Smo>m-Vt 

Ac/  ^art,Tn+ Sa.t(TJl+  AtSmo)Tn-  (Sm0(m-  A(Smoim— )  Smo_m- A(SIljTn+ 


(B.22) 


B.3.2  Calculation  of  pv(m+) 

This  distribution  is  obtained  by  marginalizing  out  all  terms  except  for  the  active  features, 
m+,  from  our  original  posterior  p(x(,  m°,  m+,  m“): 


pv(m+)  =  JJJ P(Xt'  m°,m+,m  )dxtdm°dm 

=  Af  (S^*;  SV)  =  A/'-1  (Sj+fr,  A„). 


Covariance  Form 


£V  =  S^+EfSm+ 


(B.23) 


Information  Form 

~  ®m+7It  —  ®m+  AtSIt)mom-  (Sl£  mom- A(.SI(imO)Tn- )  SXtfno  fn-Vt 

Av  Sm+AtSm+  Sm+ A(SIt  mojm-  (SXt)mo  m-  A(SItiJno  m-  )  ^xt>m0,m_A^m+ 


(B.24) 


B.3.3  Calculation  of  pModRuie(xt,  m°,  m+,  m  ) 

To  calculate  the  Modified-Rule  sparsified  posterior  we  combine  the  three  distributions  pu 
(B.21)-(B.22),  pv  (B.23)-(B.24),  and  pD  (B.16)-(B.17)  according  to  (B.20): 


PmodRule  (xt ,  m° ,  m+ ,  m  )  = 


P(y(xt,m+) 


PD(m°, 


m+ ,  m 


) 


pi/(m+) 
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Covariance  Form 


PModRule (xt,  m°,m+,m  ) 

cx  exp|-i(Sj(  m+£t  -  /*u)T£^1(Sjt,m+&  -  Mt/) 

msjm+st-nv)T^(si+zt-nv) 

~  PwV  (Sm°,m+,m-£t  “  Pw)  j 

=  exp|“5Ut  “  tXt)rSxt,m+'Zy1Slt  m+(4t  -  Ht) 
l*t)TSm+XylSl+(tt-  Ht) 

~ -  Pt)  SmoiTO+im-Evv,1S^oim+itn-  ((t  -  Mt)| 

=  exp{~5Ui  -  At)TSt_1Ut  -  At)  | 


where 


—  (SXt>m+Ejy  Sx^m+  S^E^  Sm+  +  Smo>m+m- E^  Smo tm+>m-) 


-loT 


, -1 


(B.25) 


Information  Form 

PMoDRuLE(xt,  m°,m+,  m_) 

«  «p{-l(si,m+^)TAi/(sj;im+o  +  u5(Sj,m+€t) 

+5(S^^t)TAv(S^^t)-^(S^^t) 

-5(SmO,m+,m-^t)  A^(Sm°,m+,m-^t)  +  (Sm° ,+  ,m- ^t)  | 

exp|— SXt)fn+At/SX|)m+^t  +  (SItiTn+T7t/)  £t 
+54irSm+AuS^+^f  -  (Sm+T7v/)T^t 

—  2^t  A/pSmo  m+iTn- +  (Smo  m+>m- TJp)  j" 

=  exp|-^7^tr  +  ^7^1 
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where 


Vt  S xtym+Vu  Sm+Vv  "t”  ®m° ,m+  ,m  V D 

—  Sj. tlm+A(/SXt  m+  —  Sm+AvSm+  +  Sm0t7Tl+  ,m~  ,m_ 


(B.26) 
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moving  light  source). 

Our  large-area  SLAM  algorithm  recursively  incorporates  relative-pose  constraints  using  a  view-based  representation  that  exploits  exact 
sparsity  in  the  Gaussian  canonical  form.  This  sparsity  allows  for  efficient  0(n)  update  complexity  in  the  number  of  images  composing 
the  view-based  map  by  utilizing  recent  multilevel  relaxation  techniques.  We  show  that  our  algorithmic  formulation  is  inherently  sparse 
unlike  other  feature-based  canonical  SLAM  algorithms,  which  impose  sparseness  via  pruning  approximations.  In  particular,  we 
investigate  the  sparsification  methodology  employed  by  SEIF  and  offer  new  insight  as  to  why,  and  how,  its  approximation  can  lead  to 
inconsistencies  in  the  estimated  state  errors.  Lastly,  we  present  a  novel  algorithm  for  efficiently  extracting  consistent  marginal 
covariances  useful  for  data  association  from  the  information  matrix. 
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